AutoRound-INT8-gs128
					Collection
				
A collection of models quantized in AutoRound format using Intel AutoRound, INT8, groupsize 128
					• 
				122 items
				• 
				Updated
					
				
Quantized version of tiiuae/Falcon3-3B-Instruct using torch.float32 for quantization tuning.
Fast and low memory, 2-3X speedup (slight accuracy drop at W8G128)
Quantization framework: Intel AutoRound v0.4.5
Note: this INT8 version of Falcon3-3B-Instruct has been quantized to run inference through CPU.
I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment.
wget https://github.com/intel/auto-round/archive/refs/tags/v0.4.5.tar.gz
tar -xvzf v0.4.5.tar.gz
cd auto-round-0.4.5
pip install -r requirements-cpu.txt --upgrade
pip install -vvv --no-build-isolation -e .[cpu]
  from transformers import AutoModelForCausalLM, AutoTokenizer
  model_name = "tiiuae/Falcon3-3B-Instruct"
  model = AutoModelForCausalLM.from_pretrained(model_name)
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  from auto_round import AutoRound
  bits, group_size, sym, device, amp = 8, 128, False, 'cpu', False
  autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym, device=device, amp=amp)
  autoround.quantize()
  output_dir = "./AutoRound/tiiuae_Falcon3-3B-Instruct-autoround-int8-gs128-asym"
  autoround.save_quantized(output_dir, format='auto_round', inplace=True)
This quantized model comes with no warranty. It has been developed only for research purposes.
Unable to build the model tree, the base model loops to the model itself. Learn more.