snorTTS-Indic-v0-AWQ-W4A16
This is a quantized version of snorbyte/snorTTS-Indic-v0 using AWQ (Activation-aware Weight Quantization) with W4A16 precision.
Quantization Details
| Parameter | Value |
|---|---|
| Method | AWQ (Activation-aware Weight Quantization) |
| Weight Precision | 4-bit |
| Activation Precision | 16-bit |
| Format | compressed-tensors |
| Quantization Tool | llmcompressor |
| Model Size Reduction | ~75% |
| Calibration Samples | 512 |
| Calibration Dataset | snorbyte/indic-tts-sample-snac-encoded |
Model Overview
- Architecture: LLaMA-3.2-3B
- Base Model: canopylabs/3b-hi-pretrain-research_release
- Audio Codec: SNAC @ 24 kHz, 3 codebooks
- Languages: Hindi, Gujarati, Marathi, Punjabi, Bengali, Telugu, Kannada, Malayalam, Tamil
Performance Comparison
| Metric | Original Model | This Model (AWQ) |
|---|---|---|
| Model Size | ~8GB | ~3.5GB (60% reduction) |
| Inference Speed | Baseline | Faster (4-bit computation) |
| Memory Usage | High | Low |
| Audio Quality | Reference | Minimal degradation |
Usage
With vLLM (Recommended for Production)
How to Run it
docker run \
--runtime nvidia \
--gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ~/.cache/vllm:/root/.cache/vllm \
-v ~/snor-quant:/models \
-p 8002:8002 \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
--env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
--env "HF_HUB_OFFLINE=1" \
--ipc=host \
--shm-size 32g \
--log-opt max-size=10m \
--log-opt max-file=3 \
vllm/vllm-openai:latest \
--port 8002 \
--model "/models/snorTTS-Indic-v0-AWQ-W4A16" \
--served-model-name llm \
--host 0.0.0.0 \
--max-model-len 2048 \
--max-num-seqs 5 \
--gpu-memory-utilization 0.20 \
--dtype auto \
--quantization compressed-tensors \
--trust-remote-code \
--uvicorn-log-level info
- Downloads last month
- 33
Model tree for devnagriai/snorTTS-Indic-v0-AWQ-W4A16
Unable to build the model tree, the base model loops to the model itself. Learn more.