Helion-V1.5-XL (Preview)

Helion-V1 Logo

Model Overview

Helion-V1.5-XL is a 16.2 billion parameter large language model designed for advanced natural language understanding and generation tasks. Built upon the foundation of Helion-V1.5, this XL variant incorporates architectural improvements, expanded training data, and enhanced optimization techniques to deliver superior performance across diverse benchmarks.

The model employs a decoder-only transformer architecture with Grouped Query Attention (GQA), RoPE positional encodings, and SwiGLU activations. Training utilized 4.5 trillion tokens from curated high-quality sources spanning web text, scientific literature, code repositories, and instruction-following datasets.

Architecture Specifications

Model Type:              Decoder-Only Transformer
Total Parameters:        16,247,832,576
Trainable Parameters:    16,247,832,576
Non-trainable Parameters: 0

Layers:                  48
Attention Heads:         32 (Query)
Key-Value Heads:         8 (GQA)
Hidden Dimension:        6144
Intermediate Dimension:  24576
Head Dimension:          192

Vocabulary Size:         100,000
Maximum Context Length:  16,384 tokens
RoPE Theta:             10,000.0
RoPE Scaling:           Linear (factor: 2.0)

Activation Function:     SwiGLU
Normalization:          RMSNorm (eps: 1e-6)
Attention Mechanism:    Grouped Query Attention
Positional Encoding:    Rotary Position Embedding
Flash Attention:        Enabled (v2)

Precision:              bfloat16

Performance Benchmarks

Language Understanding

Benchmark Metric Helion-V1.5-XL Helion-V1.5 LLaMA-2-13B Mistral-7B GPT-3.5-Turbo
MMLU (5-shot) Accuracy 78.9 62.3 55.8 62.5 70.0
HellaSwag (10-shot) Accuracy 85.7 79.1 82.3 81.3 85.5
ARC-Challenge (25-shot) Accuracy 82.1 71.4 78.9 79.8 85.2
ARC-Easy (25-shot) Accuracy 89.6 84.2 85.3 87.1 91.3
PIQA (zero-shot) Accuracy 83.4 79.8 80.5 81.2 84.1
WinoGrande (5-shot) Accuracy 77.3 72.1 73.7 74.8 78.2
OpenBookQA (zero-shot) Accuracy 68.7 61.4 63.2 65.9 71.5
BoolQ (zero-shot) Accuracy 84.9 79.6 81.2 82.4 86.7

Reasoning and Common Sense

Benchmark Metric Helion-V1.5-XL Helion-V1.5 LLaMA-2-13B Mistral-7B GPT-3.5-Turbo
GSM8K (8-shot) Accuracy 71.6 48.2 28.7 52.2 57.1
MATH (4-shot) Accuracy 34.7 18.9 13.5 28.4 34.1
BBH (3-shot) Average 61.8 49.3 47.2 56.1 65.4
DROP (3-shot) F1 Score 69.4 58.7 62.1 64.8 73.2
CommonsenseQA (7-shot) Accuracy 76.9 68.4 70.1 73.2 79.1

Code Generation and Understanding

Benchmark Metric Helion-V1.5-XL Helion-V1.5 LLaMA-2-13B CodeLLaMA-13B GPT-3.5-Turbo
HumanEval (pass@1) Pass Rate 67.8 45.2 29.3 46.2 48.1
HumanEval (pass@10) Pass Rate 84.3 67.9 54.1 71.8 72.5
MBPP (pass@1) Pass Rate 72.4 53.8 42.7 58.3 61.2
MBPP (pass@10) Pass Rate 87.6 74.1 68.4 79.5 81.9
DS-1000 Pass Rate 48.9 32.1 28.4 41.7 52.3
CodeXGLUE Average 81.2 69.4 65.8 74.6 83.7

Multilingual Performance

Language FLORES-101 (BLEU) XNLI (Accuracy) XStoryCloze (Accuracy)
English 100.0 (reference) 89.4 91.2
Spanish 87.3 84.6 86.9
French 86.9 83.8 85.4
German 85.1 82.7 84.1
Chinese (Simplified) 82.4 81.3 83.7
Japanese 81.8 79.8 82.4
Korean 80.9 78.6 81.1
Russian 79.7 80.2 82.8
Arabic 77.3 76.4 78.9
Hindi 76.8 75.1 77.6
Portuguese 86.1 83.2 85.7
Italian 85.4 82.9 84.8

Truthfulness and Safety

Benchmark Metric Helion-V1.5-XL Helion-V1.5 LLaMA-2-13B GPT-3.5-Turbo
TruthfulQA MC1 61.3 45.8 50.2 47.0
TruthfulQA MC2 73.8 62.1 65.4 64.2
ToxiGen Toxicity 2.1% 3.8% 4.2% 1.9%
BOLD Bias Score 0.34 0.47 0.51 0.29

Long Context Understanding

Benchmark Context Length Metric Helion-V1.5-XL LLaMA-2-13B GPT-3.5-Turbo
SCROLLS (QuALITY) 4K-6K F1 71.4 62.8 73.9
SCROLLS (Qasper) 3K-5K F1 68.7 59.3 71.2
LongBench (SingleDoc QA) 8K-12K Accuracy 63.2 51.7 67.8
LongBench (MultiDoc QA) 10K-16K Accuracy 58.9 44.3 63.4

Training Methodology

Dataset Composition

The training corpus consists of 4.5 trillion tokens sampled from the following sources:

Data Source Token Count Percentage Description
Filtered Web Text 2.025T 45% CommonCrawl filtered for quality, deduplicated
Books and Literature 900B 20% Fiction, non-fiction, technical books
Code Repositories 675B 15% GitHub, StackOverflow, documentation
Scientific Papers 450B 10% ArXiv, PubMed, academic repositories
Instruction Data 360B 8% Curated instruction-response pairs
Multilingual Corpora 90B 2% Parallel texts, translations, non-English web

Training Infrastructure

Compute Resources:        512x NVIDIA A100 80GB GPUs
Total Training Time:      672 hours (28 days)
Framework:               PyTorch 2.0.1 with FSDP
Distributed Strategy:     Fully Sharded Data Parallel (FSDP)
Mixed Precision:         bfloat16 with stochastic rounding
Communication Backend:    NCCL with InfiniBand

Total FLOPs:             ~8.2e24 FLOPs
GPU Hours:               ~344,064 GPU-hours
Peak Memory per GPU:     72GB
Interconnect Bandwidth:  400 Gbps per GPU

Optimization Configuration

Optimizer:               AdamW
Beta1:                   0.9
Beta2:                   0.95
Epsilon:                 1e-8
Weight Decay:            0.1
Gradient Clipping:       1.0

Learning Rate Schedule:  Cosine with Warmup
Peak Learning Rate:      3.0e-4
Minimum Learning Rate:   3.0e-5
Warmup Steps:            2,000
Total Training Steps:    875,000

Batch Configuration:
  Global Batch Size:     4,194,304 tokens
  Micro Batch Size:      32 samples
  Gradient Accumulation: 8 steps
  Sequence Length:       4,096 tokens

Checkpointing:
  Activation Checkpointing: Enabled
  Checkpoint Interval:      5,000 steps
  Total Checkpoints Saved:  175

Training Stages

Stage 1: Pre-training (3.8T tokens)

  • Duration: 750,000 steps
  • Objective: Next-token prediction
  • Data: General corpus (web, books, code, scientific)
  • Learning Rate: Full cosine schedule

Stage 2: Domain Adaptation (500B tokens)

  • Duration: 80,000 steps
  • Objective: Continued pre-training on specialized domains
  • Data: Enhanced code, mathematics, scientific reasoning
  • Learning Rate: 1.0e-4 constant

Stage 3: Instruction Tuning (200B tokens)

  • Duration: 45,000 steps
  • Objective: Instruction following and task alignment
  • Data: High-quality instruction-response pairs
  • Learning Rate: 5.0e-5 with linear decay

Installation and Usage

Requirements

pip install torch>=2.0.0 transformers>=4.35.0 accelerate>=0.24.0

Basic Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "DeepXR/Helion-V1.5-XL"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

prompt = "Explain the concept of quantum entanglement:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

4-bit Quantization

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

Chat Format

conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are the implications of the P vs NP problem?"}
]

prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)

Hardware Requirements

Memory Requirements (Inference)

Precision Memory Required Recommended GPU
FP32 64.9 GB 2x A100 80GB
BF16/FP16 32.5 GB A100 40GB, A6000
INT8 16.8 GB RTX 4090, A40
INT4 (NF4) 9.2 GB RTX 3090, RTX 4080

Inference Performance

Hardware Precision Tokens/Second Batch Size
A100 80GB BF16 47.3 1
A100 80GB INT8 89.6 1
A100 80GB INT4 134.2 1
H100 80GB BF16 78.1 1
H100 80GB INT4 218.7 1

Limitations and Biases

Known Limitations

  1. Knowledge Cutoff: Training data extends through January 2024. The model lacks awareness of subsequent events.

  2. Hallucination: The model may generate plausible but factually incorrect information with high confidence.

  3. Arithmetic Precision: While improved over baseline, complex multi-step mathematical computations may contain errors.

  4. Context Length Degradation: Performance decreases beyond 12,000 tokens despite 16,384 token capacity.

  5. Specialized Domain Knowledge: May lack depth in highly specialized technical, medical, or legal domains.

  6. Code Execution: Generated code requires validation and testing before deployment.

Bias Analysis

The model has been evaluated for biases across multiple dimensions:

  • Gender Bias: BOLD gender bias score of 0.34 (lower is better)
  • Racial Bias: Demonstrates residual stereotypical associations in certain contexts
  • Geographic Bias: Western-centric knowledge distribution
  • Language Bias: Performance degrades for lower-resource languages

Mitigation strategies include balanced dataset sampling, bias-aware fine-tuning, and constitutional AI principles during alignment.

Evaluation Methodology

All benchmarks were evaluated using the Language Model Evaluation Harness (lm-evaluation-harness) with standardized few-shot settings. Code evaluation used the standard HumanEval and MBPP test suites with temperature 0.2 sampling. Multilingual benchmarks employed zero-shot evaluation for consistency.

License

This model is released under the Apache License 2.0.

Copyright 2025 DeepXR

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Citation

@misc{helion-v15-xl-2024,
  title={Helion-V1.5-XL: A 16B Parameter Instruction-Tuned Language Model},
  author={DeepXR Team},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/DeepXR/Helion-V1.5-XL}
}

Acknowledgments

Training infrastructure provided by advanced cloud computing resources. Dataset curation benefited from open-source contributions including The Pile, RedPajama, and community-curated instruction datasets.

Downloads last month
86
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DeepXR/Helion-V1.5-XL

Finetuned
DeepXR/Helion-V1.5
Quantized
(1)
this model

Collection including DeepXR/Helion-V1.5-XL

Evaluation results