File size: 4,528 Bytes
5d0bfe7
 
 
 
 
 
 
 
 
 
 
 
 
3f41544
 
5d0bfe7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61cb4e1
5d0bfe7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83ae22d
5d0bfe7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
language: 
- en
license: apache-2.0
tags:
- quantization
- sinq
- int4
- efficient-inference
- text-generation
- qwen
- llm
- compression
base_model: Qwen/Qwen3-1.7B
base_model_relation: quantized
---

<p align="center">
  <img src="logo.png" alt="Logo" style="max-width: 80%; height: auto;">
</p>

<p align="center">πŸ™ <a href="https://github.com/huawei-csl/SINQ">Github</a>&nbsp;&nbsp; | &nbsp;&nbsp;πŸ“„ <a href="http://arxiv.org/abs/2509.22944">Paper</a></p>


# SINQ 4-bit Quantized Qwen3-1.7B model

This repository contains the official **4-bit quantized** version of the [`Qwen3-1.7B`](https://huggingface.co/Qwen/Qwen3-1.7B) model using the **SINQ (Sinkhorn-Normalized Quantization)** method.  
SINQ is a novel, fast and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact. 

To support the project please put a star ⭐ in the official [SINQ](https://github.com/huawei-csl/SINQ) github repository. 

## Model Details
- **Model Name:** `Qwen3-1.7B-4bit-SINQ `
- **Base Model:** [`Qwen/Qwen3-1.7B`](https://huggingface.co/Qwen/Qwen3-1.7B)
- **Task:** Text Generation
- **Framework:** PyTorch / Transformers
- **License:** [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Quantized By:** *Huawei - Computing Systems Lab*


## Quantization Details

- **Quantization Method:**  SINQ (Sinkhorn-Normalized Quantization)
- **Precision:** INT4 
- **Group Size:**  64 
- **Framework:**  PyTorch 
- **Quantization Library:**  `sinq` 

---

# πŸš€ Usage</span>

## Prerequisite
Before running the quantization script, make sure the **SINQ** library is installed.
Installation instructions and setup details are available in the [SINQ official github repository](https://github.com/huawei-csl/SINQ).

## Usage example
You can load and use the model with our wrapper based on the πŸ€— Transformers library:

```python
from transformers import AutoTokenizer
from sinq.patch_model import AutoSINQHFModel

model_name = "huawei-csl/Qwen3-1.7B-4bit-SINQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sinq_model = AutoSINQHFModel.from_quantized_safetensors(
    model_name,
    device="cuda:0",
    compute_dtype=torch.bfloat16
)

prompt = "Explain neural network quantization in one sentence."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
with torch.inference_mode():
    out_ids = sinq_model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(out_ids[0], skip_special_tokens=True))

```

<details>
<summary><span style="font-size:1.1em; font-weight:bold;">🧩 Quantization Process</span></summary>

The quantized model was obtained using the **SINQ** quantization library, following the steps below:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from sinq.patch_model import AutoSINQHFModel
from sinq.sinqlinear import BaseQuantizeConfig

# Load base model
base_model_name = "Qwen/Qwen3-1.7B"
model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Apply 4-bit SINQ quantization
quant_cfg = BaseQuantizeConfig(
    nbits=4,            # quantization bit-width
    group_size=64,     # group size
    tiling_mode="1D",   # tiling strategy
    method="sinq"       # quantization method ("asinq" for the calibrated version)
)

qmodel = AutoSINQHFModel.quantize_model(
    model,
    tokenizer=tokenizer,
    quant_config=quant_cfg,
    compute_dtype=torch.bfloat16,
    device="cuda:0"
)
```

> **Reproducibility Note**: This model was quantized using the SINQ implementation from commit [`14ad847`](https://github.com/huawei-csl/SINQ/commit/14ad847d0ab25f1794b8820506f59b5c9c1fc979) of the [SINQ](https://github.com/huawei-csl/SINQ) repository.  

</details>

</br>

---

# 🧾 How to Cite This Work

If you find **SINQ** useful in your research or applications, please
- Put a star ⭐ in the official [SINQ](https://github.com/huawei-csl/SINQ) github repository.
- Cite our <a href="http://arxiv.org/abs/2509.22944" target="_blank"><strong>paper</strong></a>:

```bibtex
@misc{muller2025sinq,
      title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights}, 
      author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
      year={2025},
      eprint={2509.22944},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2509.22944}
}
```