Legion-V2.1-LLaMa-70B — Quantized (compressed-tensors for vLLM)
This repository provides quantized runtime builds of
Tarek07/Legion-V2.1-LLaMa-70B, repackaged for vLLM using the compressed-tensors format.
TL;DR
• Quantized with branches W4A16 (INT4/A16) and W8A16 (INT8/A16).
• Serve with vLLM using--quantization compressed-tensors.
• One-shot AWQ quantization; calibration uses a chat-formatted dataset (details below).
Revisions & Branches
The
mainbranch is a landing page (model card + links). All runnable artifacts live under per-revision branches.
- main — placeholder / landing page
- W4A16 — 4-bit weights / 16-bit activations builds and runtime assets
- W8A16 — 8-bit weights / 16-bit activations builds
Quick links
- main: https://huggingface.co/TheHouseOfTheDude/Legion-V2.1-LLaMa-70B_CompressedTensors/tree/main
- W4A16: https://huggingface.co/TheHouseOfTheDude/Legion-V2.1-LLaMa-70B_CompressedTensors/tree/W4A16
- W8A16: https://huggingface.co/TheHouseOfTheDude/Legion-V2.1-LLaMa-70B_CompressedTensors/tree/W8A16
What’s inside (per revision)
- Sharded quantized weights (
*.safetensors) + index (model.safetensors.index.json) config.jsonwith compressed-tensors metadata (weight_format,quantization,quantization_config, etc.)- Tokenizer artifacts (
tokenizer.json,tokenizer.model, merges/vocab if applicable) - Optional:
chat_template.jinja(inherits the parent finetune’s chat style)
Exact files can vary by branch; see Files and versions for each revision.
Quantization recipe (from the attached script)
- Method: AWQ via llm-compressor (
AWQModifier+oneshot) targeting Linear layers only;lm_headignored from quantization. - INT8 build (W8A16): symmetric INT8 weights, group_size=128, weight-only (no activation quantization).
- Calibration dataset:
neuralmagic/LLM_compression_calibration(splittrain) with messages → rendered bytokenizer.apply_chat_template. - Calibration samples: 512 (
num_calibration_samples=512). - Max calibration sequence length: 2048 (
max_seq_length=2048). - Export: saved with
save_compressed=Trueso vLLM reads compressed-tensors metadata. :contentReference[oaicite:0]{index=0}
Notes
• The W4A16 branch follows the same pipeline but with 4-bit weight quantization (A16 at runtime).
• The W8A16 branch mirrors the script above (INT8/A16); it typically trades a bit more VRAM for extra numerical headroom.
Quickstart — vLLM (compressed-tensors)
Install vLLM (recent version recommended):
pip install vllm
Serve (adjust to your hardware):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve TheHouseOfTheDude/Legion-V2.1-LLaMa-70B_CompressedTensors \
--quantization compressed-tensors \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.70 \
--dtype bfloat16
Query (OpenAI-compatible Chat Completions):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TheHouseOfTheDude/Legion-V2.1-LLaMa-70B_CompressedTensors",
"messages": [
{"role":"system","content":"You are Legion — helpful, precise, and safe."},
{"role":"user","content":"Give three robust strategies for long-context retrieval."}
],
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.95
}'
Note:
compressed-tensorsis a vLLM runtime format. Loading directly with vanilla 🤗 Transformers is not supported.
For Transformers, use a compatible quant (e.g., GPTQ/AWQ export) or full-precision weights.
Prompting / Chat Template
This package follows the parent finetune’s chat conventions. If a chat_template.jinja is present, apply_chat_template will pick it up automatically.
Guidelines:
- Keep a concise system message to set behavior/tone.
- Structure user prompts clearly; enumerate steps for multi-part tasks.
Intended use
- General instruction-following assistants
- Long-form drafting & summarization
- RAG/agent pipelines (pair with a retriever/tool layer)
Always review the parent/base model’s license and evaluate on your domain before production use.
Lineage
- Finetuned parent: https://huggingface.co/Tarek07/Legion-V2.1-LLaMa-70B
- This repo: Quantized child of the finetune (compressed-tensors for vLLM)
Hardware tips (rule-of-thumb)
- Prefer multi-GPU (tensor parallel) for best throughput on 70B-class models.
- Long contexts are KV-cache heavy — tune
--max-model-lenand batch size. - Use BF16 on GPUs with native support; otherwise FP16.
- Enable P2P/NVLink where possible; consider CUDA Graphs if stable in your stack.
Changelog
- V2.1 (current) — Initial compressed-tensors release; branches W4A16 and W8A16 published; model card marked Quantized.
Model tree for TheHouseOfTheDude/Legion-V2.1-LLaMa-70B_CompressedTensors
Base model
Tarek07/Legion-V2.1-LLaMa-70B