snorTTS-Indic-v0-AWQ-W4A16

This is a quantized version of snorbyte/snorTTS-Indic-v0 using AWQ (Activation-aware Weight Quantization) with W4A16 precision.

Quantization Details

Parameter	Value
Method	AWQ (Activation-aware Weight Quantization)
Weight Precision	4-bit
Activation Precision	16-bit
Format	compressed-tensors
Quantization Tool	llmcompressor
Model Size Reduction	~75%
Calibration Samples	512
Calibration Dataset	snorbyte/indic-tts-sample-snac-encoded

Model Overview

Architecture: LLaMA-3.2-3B
Base Model: canopylabs/3b-hi-pretrain-research_release
Audio Codec: SNAC @ 24 kHz, 3 codebooks
Languages: Hindi, Gujarati, Marathi, Punjabi, Bengali, Telugu, Kannada, Malayalam, Tamil

Performance Comparison

Metric	Original Model	This Model (AWQ)
Model Size	~8GB	~3.5GB (60% reduction)
Inference Speed	Baseline	Faster (4-bit computation)
Memory Usage	High	Low
Audio Quality	Reference	Minimal degradation

Usage

With vLLM (Recommended for Production)

How to Run it

docker run \
--runtime nvidia \
--gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ~/.cache/vllm:/root/.cache/vllm \
-v ~/snor-quant:/models \
-p 8002:8002 \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
--env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
--env "HF_HUB_OFFLINE=1" \
--ipc=host \
--shm-size 32g \
--log-opt max-size=10m \
--log-opt max-file=3 \
vllm/vllm-openai:latest \
--port 8002 \
--model "/models/snorTTS-Indic-v0-AWQ-W4A16" \
--served-model-name llm \
--host 0.0.0.0 \
--max-model-len 2048 \
--max-num-seqs 5 \
--gpu-memory-utilization 0.20 \
--dtype auto \
--quantization compressed-tensors \
--trust-remote-code \
--uvicorn-log-level info

Downloads last month: 33

Safetensors

Model size

1B params

Tensor type

I64

I32

BF16

Model tree for devnagriai/snorTTS-Indic-v0-AWQ-W4A16

Unable to build the model tree, the base model loops to the model itself. Learn more.

devnagriai
/

snorTTS-Indic-v0-AWQ-W4A16