Apriel-1.5-15b-Thinker-FP8-W8A8
This is the FP8 W8A8 quantized version of ServiceNow-AI/Apriel-1.5-15b-Thinker optimized for inference with vLLM.
Model Description
Apriel-1.5-15b-Thinker is a multimodal reasoning model that combines text and image understanding capabilities. This FP8-quantized version provides:
- ~50% memory reduction (15GB vs 30GB VRAM for BF16)
- Maintained accuracy with dynamic FP8 quantization
- Faster inference with optimized FP8 compute on supported GPUs
- Full multimodal support for both text and image inputs
Key Features
- 15 billion parameters
- MIT License
- Supports both text-only and image+text interactions
- Optimized for instruction-following tasks
- FP8 W8A8 dynamic quantization for efficient inference
Hardware Requirements
- GPU: NVIDIA GPU with FP8 support (Hopper, Ada Lovelace, or Blackwell architecture)
- Examples: RTX 4090, RTX 5090, H100, L4, L40S
- VRAM: ~15-16GB (vs 30GB for original BF16 model)
- CUDA: 12.0 or newer recommended
Installation
pip install vllm>=0.11.0
Usage
Using vLLM (Recommended)
from vllm import LLM, SamplingParams
# Load model with FP8 quantization
model = LLM(
model="Apriel-1.5-15b-Thinker-FP8-W8A8",
quantization="fp8", # Enable FP8 dynamic quantization
trust_remote_code=True,
runner="generate",
max_model_len=2048,
gpu_memory_utilization=0.9
)
# Text-only generation
prompts = ["Explain quantum computing in simple terms."]
sampling_params = SamplingParams(
temperature=0.6,
max_tokens=256,
top_p=0.95
)
outputs = model.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Command Line Interface
# Start vLLM server with FP8 quantization
vllm serve Apriel-1.5-15b-Thinker-FP8-W8A8 \
--quantization fp8 \
--trust-remote-code \
--max-model-len 2048
# Query the server
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Apriel-1.5-15b-Thinker-FP8-W8A8",
"prompt": "Explain machine learning",
"max_tokens": 100,
"temperature": 0.6
}'
Multimodal Usage
# For image+text inputs, follow the original model's format
# See: https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker
Quantization Details
This model uses dynamic FP8 W8A8 quantization via vLLM:
- Method: Dynamic per-token quantization at inference time
- Precision: FP8 for both weights (W8) and activations (A8)
- Implementation: Quantization is applied on-the-fly by vLLM's inference engine
- No offline preprocessing: The model weights are stored in their original format and quantized dynamically
Performance Benchmarks
Tested on NVIDIA GeForce RTX 5090:
| Metric | FP8 W8A8 | BF16 (Original) |
|---|---|---|
| VRAM Usage | ~15 GB | ~30 GB |
| Memory Savings | 50% | - |
| KV Cache Size | 66,256 tokens | ~33,000 tokens |
| Max Concurrency (2K ctx) | 32.35x | ~16x |
| Inference Speed | ~4.9 tok/s | ~4.5 tok/s |
| First Token Latency | Similar | Similar |
Note: Performance may vary based on hardware, prompt length, and configuration.
Recommended Settings
# For optimal performance
model = LLM(
model="Apriel-1.5-15b-Thinker-FP8-W8A8",
quantization="fp8",
trust_remote_code=True,
runner="generate",
max_model_len=2048, # Adjust based on your needs
gpu_memory_utilization=0.9, # Use 90% of available VRAM
enforce_eager=False, # Enable CUDA graphs
enable_prefix_caching=True, # Enable KV cache reuse
enable_chunked_prefill=True # Enable chunked prefill
)
sampling_params = SamplingParams(
temperature=0.6, # Recommended by original model
max_tokens=256,
top_p=0.95,
repetition_penalty=1.0
)
Limitations
- FP8 Hardware Required: Requires GPU with native FP8 support for optimal performance
- Dynamic Quantization: Quantization overhead is applied at each inference (cached after first run)
- Accuracy: Minor numerical differences may occur compared to BF16 (typically negligible)
- First Run: Initial compilation takes ~30-50 seconds (cached for subsequent runs)
Model Card Contact
For questions about this FP8 quantized version, please open an issue on the vLLM GitHub repository.
For questions about the base model, refer to the original model card: ServiceNow-AI/Apriel-1.5-15b-Thinker
Citation
If you use this model, please cite both the original Apriel model and vLLM:
@misc{apriel2025,
title={Apriel-1.5-15b-Thinker},
author={ServiceNow AI},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker}
}
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
License
This quantized model inherits the MIT License from the original Apriel-1.5-15b-Thinker model.
MIT License
Copyright (c) 2025 ServiceNow AI
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Acknowledgments
- Original model by ServiceNow AI
- FP8 quantization support by vLLM team
- Quantization performed using vLLM v0.11.0
Model Type: Multimodal Language Model (Quantized) Base Model: Apriel-1.5-15b-Thinker Quantization: FP8 W8A8 Dynamic Inference Engine: vLLM Last Updated: October 2025
- Downloads last month
- 2
Model tree for Bellesteck/Apriel-1.5-15b-Thinker-FP8-W8A8
Base model
ServiceNow-AI/Apriel-1.5-15b-Thinker