Legion-V2.1-LLaMa-70B — Quantized (compressed-tensors for vLLM)

This repository provides quantized runtime builds of
Tarek07/Legion-V2.1-LLaMa-70B, repackaged for vLLM using the compressed-tensors format.

TL;DR
• Quantized with branches W4A16 (INT4/A16) and W8A16 (INT8/A16).
• Serve with vLLM using --quantization compressed-tensors.
• One-shot AWQ quantization; calibration uses a chat-formatted dataset (details below).

Revisions & Branches

The main branch is a landing page (model card + links). All runnable artifacts live under per-revision branches.

main — placeholder / landing page
W4A16 — 4-bit weights / 16-bit activations builds and runtime assets
W8A16 — 8-bit weights / 16-bit activations builds

Quick links

What’s inside (per revision)

Sharded quantized weights (*.safetensors) + index (model.safetensors.index.json)
config.json with compressed-tensors metadata (weight_format, quantization, quantization_config, etc.)
Tokenizer artifacts (tokenizer.json, tokenizer.model, merges/vocab if applicable)
Optional: chat_template.jinja (inherits the parent finetune’s chat style)

Exact files can vary by branch; see Files and versions for each revision.

Quantization recipe (from the attached script)

Method: AWQ via llm-compressor (AWQModifier + oneshot) targeting Linear layers only; lm_head ignored from quantization.
INT8 build (W8A16): symmetric INT8 weights, group_size=128, weight-only (no activation quantization).
Calibration dataset: neuralmagic/LLM_compression_calibration (split train) with messages → rendered by tokenizer.apply_chat_template.
Calibration samples: 512 (num_calibration_samples=512).
Max calibration sequence length: 2048 (max_seq_length=2048).
Export: saved with save_compressed=True so vLLM reads compressed-tensors metadata. :contentReference[oaicite:0]{index=0}

Notes
• The W4A16 branch follows the same pipeline but with 4-bit weight quantization (A16 at runtime).
• The W8A16 branch mirrors the script above (INT8/A16); it typically trades a bit more VRAM for extra numerical headroom.

Quickstart — vLLM (compressed-tensors)

Install vLLM (recent version recommended):

pip install vllm

Serve (adjust to your hardware):

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve TheHouseOfTheDude/Legion-V2.1-LLaMa-70B_CompressedTensors \
  --quantization compressed-tensors \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.70 \
  --dtype bfloat16

Query (OpenAI-compatible Chat Completions):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheHouseOfTheDude/Legion-V2.1-LLaMa-70B_CompressedTensors",
    "messages": [
      {"role":"system","content":"You are Legion — helpful, precise, and safe."},
      {"role":"user","content":"Give three robust strategies for long-context retrieval."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Note: compressed-tensors is a vLLM runtime format. Loading directly with vanilla 🤗 Transformers is not supported.
For Transformers, use a compatible quant (e.g., GPTQ/AWQ export) or full-precision weights.

Prompting / Chat Template

This package follows the parent finetune’s chat conventions. If a chat_template.jinja is present, apply_chat_template will pick it up automatically.

Guidelines:

Keep a concise system message to set behavior/tone.
Structure user prompts clearly; enumerate steps for multi-part tasks.

Intended use

General instruction-following assistants
Long-form drafting & summarization
RAG/agent pipelines (pair with a retriever/tool layer)

Always review the parent/base model’s license and evaluate on your domain before production use.

Lineage

Finetuned parent: https://huggingface.co/Tarek07/Legion-V2.1-LLaMa-70B
This repo: Quantized child of the finetune (compressed-tensors for vLLM)

Hardware tips (rule-of-thumb)

Prefer multi-GPU (tensor parallel) for best throughput on 70B-class models.
Long contexts are KV-cache heavy — tune --max-model-len and batch size.
Use BF16 on GPUs with native support; otherwise FP16.
Enable P2P/NVLink where possible; consider CUDA Graphs if stable in your stack.

Changelog

V2.1 (current) — Initial compressed-tensors release; branches W4A16 and W8A16 published; model card marked Quantized.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for TheHouseOfTheDude/Legion-V2.1-LLaMa-70B_CompressedTensors

Base model

Tarek07/Legion-V2.1-LLaMa-70B

Quantized

(15)

this model