Apriel-1.5-15b-Thinker-FP8-W8A8

This is the FP8 W8A8 quantized version of ServiceNow-AI/Apriel-1.5-15b-Thinker optimized for inference with vLLM.

Model Description

Apriel-1.5-15b-Thinker is a multimodal reasoning model that combines text and image understanding capabilities. This FP8-quantized version provides:

~50% memory reduction (15GB vs 30GB VRAM for BF16)
Maintained accuracy with dynamic FP8 quantization
Faster inference with optimized FP8 compute on supported GPUs
Full multimodal support for both text and image inputs

Key Features

15 billion parameters
MIT License
Supports both text-only and image+text interactions
Optimized for instruction-following tasks
FP8 W8A8 dynamic quantization for efficient inference

Hardware Requirements

GPU: NVIDIA GPU with FP8 support (Hopper, Ada Lovelace, or Blackwell architecture)
- Examples: RTX 4090, RTX 5090, H100, L4, L40S
VRAM: ~15-16GB (vs 30GB for original BF16 model)
CUDA: 12.0 or newer recommended

Installation

pip install vllm>=0.11.0

Usage

Using vLLM (Recommended)

from vllm import LLM, SamplingParams

# Load model with FP8 quantization
model = LLM(
    model="Apriel-1.5-15b-Thinker-FP8-W8A8",
    quantization="fp8",  # Enable FP8 dynamic quantization
    trust_remote_code=True,
    runner="generate",
    max_model_len=2048,
    gpu_memory_utilization=0.9
)

# Text-only generation
prompts = ["Explain quantum computing in simple terms."]
sampling_params = SamplingParams(
    temperature=0.6,
    max_tokens=256,
    top_p=0.95
)

outputs = model.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Command Line Interface

# Start vLLM server with FP8 quantization
vllm serve Apriel-1.5-15b-Thinker-FP8-W8A8 \
    --quantization fp8 \
    --trust-remote-code \
    --max-model-len 2048

# Query the server
curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Apriel-1.5-15b-Thinker-FP8-W8A8",
        "prompt": "Explain machine learning",
        "max_tokens": 100,
        "temperature": 0.6
    }'

Multimodal Usage

# For image+text inputs, follow the original model's format
# See: https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker

Quantization Details

This model uses dynamic FP8 W8A8 quantization via vLLM:

Method: Dynamic per-token quantization at inference time
Precision: FP8 for both weights (W8) and activations (A8)
Implementation: Quantization is applied on-the-fly by vLLM's inference engine
No offline preprocessing: The model weights are stored in their original format and quantized dynamically

Performance Benchmarks

Tested on NVIDIA GeForce RTX 5090:

Metric	FP8 W8A8	BF16 (Original)
VRAM Usage	~15 GB	~30 GB
Memory Savings	50%	-
KV Cache Size	66,256 tokens	~33,000 tokens
Max Concurrency (2K ctx)	32.35x	~16x
Inference Speed	~4.9 tok/s	~4.5 tok/s
First Token Latency	Similar	Similar

Note: Performance may vary based on hardware, prompt length, and configuration.

Recommended Settings

# For optimal performance
model = LLM(
    model="Apriel-1.5-15b-Thinker-FP8-W8A8",
    quantization="fp8",
    trust_remote_code=True,
    runner="generate",
    max_model_len=2048,              # Adjust based on your needs
    gpu_memory_utilization=0.9,      # Use 90% of available VRAM
    enforce_eager=False,             # Enable CUDA graphs
    enable_prefix_caching=True,      # Enable KV cache reuse
    enable_chunked_prefill=True      # Enable chunked prefill
)

sampling_params = SamplingParams(
    temperature=0.6,      # Recommended by original model
    max_tokens=256,
    top_p=0.95,
    repetition_penalty=1.0
)

Limitations

FP8 Hardware Required: Requires GPU with native FP8 support for optimal performance
Dynamic Quantization: Quantization overhead is applied at each inference (cached after first run)
Accuracy: Minor numerical differences may occur compared to BF16 (typically negligible)
First Run: Initial compilation takes ~30-50 seconds (cached for subsequent runs)

Model Card Contact

For questions about this FP8 quantized version, please open an issue on the vLLM GitHub repository.

For questions about the base model, refer to the original model card: ServiceNow-AI/Apriel-1.5-15b-Thinker

Citation

If you use this model, please cite both the original Apriel model and vLLM:

@misc{apriel2025,
  title={Apriel-1.5-15b-Thinker},
  author={ServiceNow AI},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker}
}

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

License

This quantized model inherits the MIT License from the original Apriel-1.5-15b-Thinker model.

MIT License

Copyright (c) 2025 ServiceNow AI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Acknowledgments

Original model by ServiceNow AI
FP8 quantization support by vLLM team
Quantization performed using vLLM v0.11.0

Model Type: Multimodal Language Model (Quantized) Base Model: Apriel-1.5-15b-Thinker Quantization: FP8 W8A8 Dynamic Inference Engine: vLLM Last Updated: October 2025

Downloads last month: 2

Safetensors

Model size

14B params

Tensor type

BF16

Model tree for Bellesteck/Apriel-1.5-15b-Thinker-FP8-W8A8

Base model

ServiceNow-AI/Apriel-1.5-15b-Thinker

Finetuned

(6)

this model