MISHANM/google-gemma-3-12b-it-fp8

This model is an advanced fp8 quantized version of google/gemma-3-12b-it, expertly optimized for use on suitable hardware platforms. Utilizing the fp8 quantization strategy, it offers impressive computational efficiency, leading to much quicker processing speeds and reduced resource demands while maintaining the high-performance benchmarks of the original model. This quantized modification is especially beneficial for settings that require high throughput and rapid responsiveness, guaranteeing the model's reliability and effectiveness in managing intricate tasks. As a result, it showcases a cutting-edge synergy between improved performance and resource utilization, designed for innovative applications in varied computational environments.

Model Details

Tasks: Causal Language Modeling, Text Generation
Base Model: google/gemma-3-12b-it
Quantization Format: fp8

Device Used

GPUs: 1*AMD Instinct™ MI210 Accelerators

Transformers library

 pip install git+https://github.com/huggingface/[email protected]

Inference with Transformers


from transformers import AutoProcessor, Gemma3ForConditionalGeneration, BitsAndBytesConfig  
from PIL import Image  
import torch  
  
model_id = "MISHANM/google-gemma-3-12b-it-fp8"  
 
# Load the model with 8-bit quantization  
model = Gemma3ForConditionalGeneration.from_pretrained(  
   model_id, device_map="auto"
).eval()  
 
processor = AutoProcessor.from_pretrained(model_id)  
 
# Define chat messages for inference  
messages = [  
   {  
       "role": "system",  
       "content": [{"type": "text", "text": "You are a helpful assistant."}]  
   },  
   {  
       "role": "user",  
       "content": [  
           {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},  
           {"type": "text", "text": "Describe this image in detail."}  
       ]  
   }  
]  
 
# Prepare inputs for the model  
inputs = processor.apply_chat_template(  
   messages, add_generation_prompt=True, tokenize=True,  
   return_dict=True, return_tensors="pt"  
).to(model.device, dtype=torch.bfloat16)  
 
input_len = inputs["input_ids"].shape[-1]  
 
# Generate model output  
with torch.inference_mode():  
   generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)  
   generation = generation[0][input_len:]  
 
# Decode the generated output  
decoded = processor.decode(generation, skip_special_tokens=True)  
print(decoded)

Citation Information

@misc{MISHANM/google-gemma-3-12b-it-fp8,
  author = {Mishan Maurya},
  title = {Introducing fp8 quantized version of google/gemma-3-12b-it},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  
}

Downloads last month: 965

Safetensors

Model size

12B params

Tensor type

F32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MISHANM/google-gemma-3-12b-it-fp8

Base model

google/gemma-3-12b-pt

Finetuned

google/gemma-3-12b-it

Quantized

(114)

this model