Helion-V1.5-XL (Preview)

Model Overview

Helion-V1.5-XL is a 16.2 billion parameter large language model designed for advanced natural language understanding and generation tasks. Built upon the foundation of Helion-V1.5, this XL variant incorporates architectural improvements, expanded training data, and enhanced optimization techniques to deliver superior performance across diverse benchmarks.

The model employs a decoder-only transformer architecture with Grouped Query Attention (GQA), RoPE positional encodings, and SwiGLU activations. Training utilized 4.5 trillion tokens from curated high-quality sources spanning web text, scientific literature, code repositories, and instruction-following datasets.

Architecture Specifications

Model Type:              Decoder-Only Transformer
Total Parameters:        16,247,832,576
Trainable Parameters:    16,247,832,576
Non-trainable Parameters: 0

Layers:                  48
Attention Heads:         32 (Query)
Key-Value Heads:         8 (GQA)
Hidden Dimension:        6144
Intermediate Dimension:  24576
Head Dimension:          192

Vocabulary Size:         100,000
Maximum Context Length:  16,384 tokens
RoPE Theta:             10,000.0
RoPE Scaling:           Linear (factor: 2.0)

Activation Function:     SwiGLU
Normalization:          RMSNorm (eps: 1e-6)
Attention Mechanism:    Grouped Query Attention
Positional Encoding:    Rotary Position Embedding
Flash Attention:        Enabled (v2)

Precision:              bfloat16

Performance Benchmarks

Language Understanding

Benchmark	Metric	Helion-V1.5-XL	Helion-V1.5	LLaMA-2-13B	Mistral-7B	GPT-3.5-Turbo
MMLU (5-shot)	Accuracy	78.9	62.3	55.8	62.5	70.0
HellaSwag (10-shot)	Accuracy	85.7	79.1	82.3	81.3	85.5
ARC-Challenge (25-shot)	Accuracy	82.1	71.4	78.9	79.8	85.2
ARC-Easy (25-shot)	Accuracy	89.6	84.2	85.3	87.1	91.3
PIQA (zero-shot)	Accuracy	83.4	79.8	80.5	81.2	84.1
WinoGrande (5-shot)	Accuracy	77.3	72.1	73.7	74.8	78.2
OpenBookQA (zero-shot)	Accuracy	68.7	61.4	63.2	65.9	71.5
BoolQ (zero-shot)	Accuracy	84.9	79.6	81.2	82.4	86.7

Reasoning and Common Sense

Benchmark	Metric	Helion-V1.5-XL	Helion-V1.5	LLaMA-2-13B	Mistral-7B	GPT-3.5-Turbo
GSM8K (8-shot)	Accuracy	71.6	48.2	28.7	52.2	57.1
MATH (4-shot)	Accuracy	34.7	18.9	13.5	28.4	34.1
BBH (3-shot)	Average	61.8	49.3	47.2	56.1	65.4
DROP (3-shot)	F1 Score	69.4	58.7	62.1	64.8	73.2
CommonsenseQA (7-shot)	Accuracy	76.9	68.4	70.1	73.2	79.1

Code Generation and Understanding

Benchmark	Metric	Helion-V1.5-XL	Helion-V1.5	LLaMA-2-13B	CodeLLaMA-13B	GPT-3.5-Turbo
HumanEval (pass@1)	Pass Rate	67.8	45.2	29.3	46.2	48.1
HumanEval (pass@10)	Pass Rate	84.3	67.9	54.1	71.8	72.5
MBPP (pass@1)	Pass Rate	72.4	53.8	42.7	58.3	61.2
MBPP (pass@10)	Pass Rate	87.6	74.1	68.4	79.5	81.9
DS-1000	Pass Rate	48.9	32.1	28.4	41.7	52.3
CodeXGLUE	Average	81.2	69.4	65.8	74.6	83.7

Multilingual Performance

Language	FLORES-101 (BLEU)	XNLI (Accuracy)	XStoryCloze (Accuracy)
English	100.0 (reference)	89.4	91.2
Spanish	87.3	84.6	86.9
French	86.9	83.8	85.4
German	85.1	82.7	84.1
Chinese (Simplified)	82.4	81.3	83.7
Japanese	81.8	79.8	82.4
Korean	80.9	78.6	81.1
Russian	79.7	80.2	82.8
Arabic	77.3	76.4	78.9
Hindi	76.8	75.1	77.6
Portuguese	86.1	83.2	85.7
Italian	85.4	82.9	84.8

Truthfulness and Safety

Benchmark	Metric	Helion-V1.5-XL	Helion-V1.5	LLaMA-2-13B	GPT-3.5-Turbo
TruthfulQA	MC1	61.3	45.8	50.2	47.0
TruthfulQA	MC2	73.8	62.1	65.4	64.2
ToxiGen	Toxicity	2.1%	3.8%	4.2%	1.9%
BOLD	Bias Score	0.34	0.47	0.51	0.29

Long Context Understanding

Benchmark	Context Length	Metric	Helion-V1.5-XL	LLaMA-2-13B	GPT-3.5-Turbo
SCROLLS (QuALITY)	4K-6K	F1	71.4	62.8	73.9
SCROLLS (Qasper)	3K-5K	F1	68.7	59.3	71.2
LongBench (SingleDoc QA)	8K-12K	Accuracy	63.2	51.7	67.8
LongBench (MultiDoc QA)	10K-16K	Accuracy	58.9	44.3	63.4

Training Methodology

Dataset Composition

The training corpus consists of 4.5 trillion tokens sampled from the following sources:

Data Source	Token Count	Percentage	Description
Filtered Web Text	2.025T	45%	CommonCrawl filtered for quality, deduplicated
Books and Literature	900B	20%	Fiction, non-fiction, technical books
Code Repositories	675B	15%	GitHub, StackOverflow, documentation
Scientific Papers	450B	10%	ArXiv, PubMed, academic repositories
Instruction Data	360B	8%	Curated instruction-response pairs
Multilingual Corpora	90B	2%	Parallel texts, translations, non-English web

Training Infrastructure

Compute Resources:        512x NVIDIA A100 80GB GPUs
Total Training Time:      672 hours (28 days)
Framework:               PyTorch 2.0.1 with FSDP
Distributed Strategy:     Fully Sharded Data Parallel (FSDP)
Mixed Precision:         bfloat16 with stochastic rounding
Communication Backend:    NCCL with InfiniBand

Total FLOPs:             ~8.2e24 FLOPs
GPU Hours:               ~344,064 GPU-hours
Peak Memory per GPU:     72GB
Interconnect Bandwidth:  400 Gbps per GPU

Optimization Configuration

Optimizer:               AdamW
Beta1:                   0.9
Beta2:                   0.95
Epsilon:                 1e-8
Weight Decay:            0.1
Gradient Clipping:       1.0

Learning Rate Schedule:  Cosine with Warmup
Peak Learning Rate:      3.0e-4
Minimum Learning Rate:   3.0e-5
Warmup Steps:            2,000
Total Training Steps:    875,000

Batch Configuration:
  Global Batch Size:     4,194,304 tokens
  Micro Batch Size:      32 samples
  Gradient Accumulation: 8 steps
  Sequence Length:       4,096 tokens

Checkpointing:
  Activation Checkpointing: Enabled
  Checkpoint Interval:      5,000 steps
  Total Checkpoints Saved:  175

Training Stages

Stage 1: Pre-training (3.8T tokens)

Duration: 750,000 steps
Objective: Next-token prediction
Data: General corpus (web, books, code, scientific)
Learning Rate: Full cosine schedule

Stage 2: Domain Adaptation (500B tokens)

Duration: 80,000 steps
Objective: Continued pre-training on specialized domains
Data: Enhanced code, mathematics, scientific reasoning
Learning Rate: 1.0e-4 constant

Stage 3: Instruction Tuning (200B tokens)

Duration: 45,000 steps
Objective: Instruction following and task alignment
Data: High-quality instruction-response pairs
Learning Rate: 5.0e-5 with linear decay

Installation and Usage

Requirements

pip install torch>=2.0.0 transformers>=4.35.0 accelerate>=0.24.0

Basic Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "DeepXR/Helion-V1.5-XL"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

prompt = "Explain the concept of quantum entanglement:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

4-bit Quantization

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

Chat Format

conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are the implications of the P vs NP problem?"}
]

prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)

Hardware Requirements

Memory Requirements (Inference)

Precision	Memory Required	Recommended GPU
FP32	64.9 GB	2x A100 80GB
BF16/FP16	32.5 GB	A100 40GB, A6000
INT8	16.8 GB	RTX 4090, A40
INT4 (NF4)	9.2 GB	RTX 3090, RTX 4080

Inference Performance

Hardware	Precision	Tokens/Second	Batch Size
A100 80GB	BF16	47.3	1
A100 80GB	INT8	89.6	1
A100 80GB	INT4	134.2	1
H100 80GB	BF16	78.1	1
H100 80GB	INT4	218.7	1

Limitations and Biases

Known Limitations

Knowledge Cutoff: Training data extends through January 2024. The model lacks awareness of subsequent events.
Hallucination: The model may generate plausible but factually incorrect information with high confidence.
Arithmetic Precision: While improved over baseline, complex multi-step mathematical computations may contain errors.
Context Length Degradation: Performance decreases beyond 12,000 tokens despite 16,384 token capacity.
Specialized Domain Knowledge: May lack depth in highly specialized technical, medical, or legal domains.
Code Execution: Generated code requires validation and testing before deployment.

Bias Analysis

The model has been evaluated for biases across multiple dimensions:

Gender Bias: BOLD gender bias score of 0.34 (lower is better)
Racial Bias: Demonstrates residual stereotypical associations in certain contexts
Geographic Bias: Western-centric knowledge distribution
Language Bias: Performance degrades for lower-resource languages

Mitigation strategies include balanced dataset sampling, bias-aware fine-tuning, and constitutional AI principles during alignment.

Evaluation Methodology

All benchmarks were evaluated using the Language Model Evaluation Harness (lm-evaluation-harness) with standardized few-shot settings. Code evaluation used the standard HumanEval and MBPP test suites with temperature 0.2 sampling. Multilingual benchmarks employed zero-shot evaluation for consistency.

License

This model is released under the Apache License 2.0.

Copyright 2025 DeepXR

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Citation

@misc{helion-v15-xl-2024,
  title={Helion-V1.5-XL: A 16B Parameter Instruction-Tuned Language Model},
  author={DeepXR Team},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/DeepXR/Helion-V1.5-XL}
}

Acknowledgments

Training infrastructure provided by advanced cloud computing resources. Dataset curation benefited from open-source contributions including The Pile, RedPajama, and community-curated instruction datasets.

Downloads last month: 86

Model tree for DeepXR/Helion-V1.5-XL

Base model

meta-llama/Llama-2-7b-hf

Finetuned

DeepXR/Helion-V1.5

Quantized

(1)

this model

Collection including DeepXR/Helion-V1.5-XL

Helion V1.5

Collection

Series of the Helion V1.5 • 2 items • Updated 5 days ago • 2

Evaluation results

5-shot Accuracy on MMLU
self-reported

78.900
Pass@1 on HumanEval
self-reported

67.800

View on Papers With Code