phaedawg's picture
Adding W4A16 to the branches
70e4d82 verified
|
raw
history blame
6.55 kB
metadata
pipeline_tag: text-generation
tags:
  - text-generation
  - conversational
  - compressed-tensors
  - awq
  - w4a16
  - quantized
base_model: TheDrummer/Behemoth-R1-123B-v2
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
inference: false
model-index:
  - name: Behemoth-R1-123B-v2_Compressed-Tensors (AWQ W4A16)
    results: []
    metadata:
      base_model: TheDrummer/Behemoth-R1-123B-v2
      quantized: true
      quantization: awq
      weight_format: compressed-tensors
license: cc-by-nc-4.0
language:
  - en

Behemoth-R1-123B-v2 — Quantized (compressed-tensors for vLLM)

Revisions & Branches

  • mainplaceholder landing branch. The canonical README lives here; model files may be minimal.
  • W4A16 — Symmetrical AWQ 4‑bit weights / 16‑bit activations builds and related assets are published under this revision. (Use this for Marlin Kernel with VLLM)
  • W4A16-ASYM — AWQ 4‑bit weights / 16‑bit activations builds and related assets are published under this revision.
  • INT8-W8A16 — 8‑bit weights / 16‑bit activations builds (e.g., INT8) published under this revision.

🔗 Quick links:
Browse main · Browse W4A16 · Browse W4A16-ASYM · Browse INT8-W8A16

This repository hosts multiple quantizations of the finetuned parent model for vLLM using the compressed-tensors runtime format.

This repository provides quantized packages of
TheDrummer/Behemoth-R1-123B-v2 (a finetune of mistralai/Mistral-Large-Instruct-2411), packaged for vLLM using compressed-tensors.

TL;DR

  • This repo is quantized (e.g., AWQ W4A16, AWQ W4A16_ASYM, and INT8 W8A16) for vLLM.
  • Load with vLLM using --quantization compressed-tensors (select the branch with your desired quant).
  • Typical AWQ recipe: group_size=128, keep lm_head in higher precision; uses the upstream Mistral‑Instruct chat template.

Repository Contents

  • Quantized weights in sharded .safetensors (model-00001-of-XXXXX.safetensors + model.safetensors.index.json)
  • config.json with compressed-tensors metadata
  • Tokenizer artifacts (e.g., tokenizer.json, tokenizer.model)
  • (If present) chat_template.jinja
  • This README.md

Exact file list may vary by release; see Files and versions.


Lineage


Quickstart — vLLM (compressed-tensors)

Install vLLM (use a recent version):

pip install vllm

Serve the quantized model (adjust parallelism to your hardware):

# Example: tensor parallel across 8 GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Behemoth-R1-123B-v2_Compressed-Tensors   --quantization compressed-tensors   --tensor-parallel-size 8   --max-model-len 32768   --gpu-memory-utilization 0.70   --dtype bfloat16        # or float16 on GPUs without strong BF16

Query via Chat Completions:

curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "TheHouseOfTheDude/Behemoth-R1-123B-v2_Compressed-Tensors",
    "messages": [
      {"role":"system","content":"You are Behemoth, helpful, precise, and safe."},
      {"role":"user","content":"Outline a retrieval pipeline for legal documents."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Note: compressed-tensors is a vLLM runtime format. Loading this artifact directly in vanilla 🤗 Transformers is not supported; use vLLM for inference. If you need Transformers inference, use a different export (e.g., GPTQ/AWQ .safetensors compatible with Transformers) or full‑precision weights.


Prompting / Chat Template

This package inherits the Mistral‑Instruct chat conventions from its parent finetune. If a chat_template.jinja is present, it is applied automatically by apply_chat_template within serving stacks that support it.

Tips

  • Provide a concise system role.
  • Structure multi‑step user prompts explicitly.
  • For tool use, include clear schemas and results to minimize hallucinations.

Recommended Generation Settings

Starting points (tune for your latency/quality targets):

  • temperature: 0.4–0.9 (0.6–0.8 common)
  • top_p: 0.9–0.95
  • max_new_tokens: 256–2048+
  • Optional repetition_penalty: 1.05–1.15
  • Enable vLLM batching/scheduling features for throughput.

Hardware Guidance

  • 123B is large; multi‑GPU with tensor parallelism is recommended.
  • Quantization reduces weights memory; KV cache (activations) still dominates at long context. Adjust --max-model-len and batch size accordingly.
  • Use BF16 where supported; otherwise FP16.
  • CUDA Graphs can help if stable in your stack.

Evaluation & Safety

  • No official benchmark set is included; evaluate on your tasks before production.
  • Apply content safety, guardrails, and human review for high‑stakes use cases.

License & Usage

This distribution inherits licenses/restrictions of:

  • mistralai/Mistral-Large-Instruct-2411 (base)
  • TheDrummer/Behemoth-R1-123B-v2 (finetune)

Using this model implies acceptance of the upstream terms.


Changelog

  • v2 (current)Quantized releases (AWQ W4A16_ASYM and INT8 W8A16) under TheHouseOfTheDude.

Links