Zen Max - Kimi K2 Thinking Architecture

Organization: Zen LM (Hanzo AI Γ— Zoo Labs Foundation)
Base Model: Moonshot AI Kimi K2 Thinking (DeepseekV3ForCausalLM)
Parameters: 671B total (384 experts Γ— ~1.75B each, 8 active per token = ~14B)
License: Apache 2.0
Context Window: 256K tokens
Thinking Capacity: 96K-128K thinking tokens per step
Architecture: DeepseekV3 MoE (Mixture of Experts)

Model Overview

Zen Max is a reasoning-first language model built on Moonshot AI's Kimi K2 Thinking architecture, designed for test-time scaling through extended thinking and tool-calling capabilities.

Built as a thinking agent, Zen Max reasons step-by-step while using tools, executing 200-300 sequential tool calls without human interference, reasoning coherently across hundreds of steps to solve complex problems.

Note: This repository contains configuration files and documentation for Zen Max. The full model weights (~1TB) are available from the base model: moonshotai/Kimi-K2-Thinking. Zen-specific fine-tuning instructions and adapters will be provided in future releases.

Key Capabilities

1. Agentic Reasoning (HLE: 44.9%)

  • Extended chain-of-thought reasoning with <think> tags
  • Multi-step planning and execution
  • Adaptive reasoning with hypothesis generation and refinement
  • Think β†’ search β†’ code β†’ verify β†’ think cycles

2. Agentic Search & Browsing (BrowseComp: 60.2%)

  • Goal-directed web-based reasoning
  • 200-300 sequential tool calls for information gathering
  • Real-world information collection and synthesis
  • Dynamic search β†’ browser β†’ reasoning loops

3. Agentic Coding (SWE-Bench Verified: 71.3%)

  • Multi-language support (100+ languages)
  • Agentic coding workflows with tool integration
  • Component-heavy web development (React, HTML)
  • Terminal automation (Terminal-Bench: 47.1%)

4. Mathematical Reasoning

  • AIME 2025: 99.1% (with Python)
  • HMMT 2025: 95.1% (with Python)
  • IMO-AnswerBench: 78.6%
  • GPQA-Diamond: 84.5%

Architecture Features

Test-Time Scaling

  • Thinking Tokens: 96K-128K per reasoning step
  • Extended Context: 256K tokens
  • Sequential Tool Calls: 200-300 without human intervention
  • Parallel Rollouts: Heavy mode with 8 simultaneous trajectories

INT4 Quantization-Aware Training

  • Native INT4 inference support
  • 2x generation speed improvement
  • State-of-the-art performance at INT4 precision
  • Optimized for low-bit quantization during post-training

Inference Efficiency

  • Quantization-aware training (QAT) for MoE components
  • INT4 weight-only quantization
  • ~50% latency reduction
  • Minimal performance degradation

Benchmark Performance

Reasoning Tasks

Benchmark Score Notes
HLE (with tools) 44.9% vs Human baseline 29.2%
AIME 2025 (with Python) 99.1% 75.2% without tools
HMMT 2025 (with Python) 95.1% 70.4% without tools
IMO-AnswerBench 78.6% Mathematical olympiad
GPQA-Diamond 84.5% Expert-level questions

Agentic Search

Benchmark Score Notes
BrowseComp 60.2% vs Human 29.2%
BrowseComp-ZH 62.3% Chinese browsing
Seal-0 56.3% Real-world info
FinSearchComp-T3 47.4% Financial search
Frames 87.0% Multi-step search

Coding

Benchmark Score Notes
SWE-Bench Verified 71.3% Software engineering
SWE-Multilingual 61.1% Multi-language coding
Multi-SWE-Bench 41.9% Multiple repositories
LiveCodeBench v6 83.1% Competitive programming
Terminal-Bench 47.1% Shell automation

General Capabilities

Benchmark Score Notes
MMLU-Pro 84.6% Professional knowledge
MMLU-Redux 94.4% General knowledge
Longform Writing 73.8% Creative writing
HealthBench 58.0% Medical knowledge

Training Approach

Base Architecture

  • Kimi K2 Thinking foundation
  • Mixture of Experts (MoE) components
  • Extended thinking token support
  • Multi-modal reasoning capabilities

Zen Identity Fine-Tuning

  1. Constitutional AI Training: Hanzo AI principles and values
  2. Tool-Calling Specialization: 200-300 step sequences
  3. Thinking Mode Optimization: Extended reasoning patterns
  4. Multi-Agent Workflows: Coordinated task execution

Optimization

  • INT4 quantization-aware training
  • MoE component optimization
  • Context management strategies
  • Parallel trajectory aggregation (Heavy Mode)

Usage Examples

1. Extended Reasoning with Tools

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("zenlm/zen-max")
tokenizer = AutoTokenizer.from_pretrained("zenlm/zen-max")

# Enable thinking mode with tool access
messages = [
    {
        "role": "user",
        "content": "Research and analyze the latest developments in quantum computing, then write a comprehensive report."
    }
]

# Model will:
# 1. Think about search strategy
# 2. Execute 50+ web searches
# 3. Browse relevant pages
# 4. Synthesize information
# 5. Generate structured report
response = model.chat(tokenizer, messages, thinking_budget=128000, max_tool_calls=300)

2. Agentic Coding Workflow

# Component-heavy web development
messages = [
    {
        "role": "user",
        "content": "Build a fully functional Word clone with React, including document editing, formatting, and export features."
    }
]

# Model will:
# 1. Plan component architecture
# 2. Generate HTML/React code
# 3. Implement styling and interactions
# 4. Test and debug iteratively
# 5. Deliver production-ready application
response = model.chat(tokenizer, messages, thinking_budget=96000, enable_tools=True)

3. Mathematical Problem Solving

# PhD-level mathematics with Python
messages = [
    {
        "role": "user",
        "content": "Solve the hyperbolic space sampling problem involving Lorentz model and Brownian bridge covariance."
    }
]

# Model will:
# 1. Analyze mathematical structure
# 2. Execute Python computations
# 3. Derive closed-form solutions
# 4. Verify results numerically
response = model.chat(tokenizer, messages, thinking_budget=128000, python_enabled=True)

4. Heavy Mode (Parallel Reasoning)

# 8 parallel trajectories with reflective aggregation
messages = [
    {
        "role": "user",
        "content": "Comprehensive analysis of climate change solutions across economics, technology, and policy."
    }
]

response = model.chat(
    tokenizer, 
    messages, 
    mode="heavy",  # 8 parallel rollouts
    thinking_budget=128000,
    enable_reflection=True
)

Configuration

Thinking Budget

  • Low: 32K thinking tokens (fast responses)
  • Medium: 96K thinking tokens (balanced)
  • High: 128K thinking tokens (complex reasoning)
  • Heavy Mode: 8 Γ— 128K parallel trajectories

Tool Configuration

tools = {
    "search": True,          # Web search
    "browser": True,         # Page browsing
    "python": True,          # Code execution
    "bash": True,            # Shell commands
    "file_operations": True, # File I/O
}

Context Management

  • Context Window: 256K tokens
  • Auto-hiding: Tool outputs hidden when exceeding context
  • Smart truncation: Preserves reasoning chain and key results

Hardware Requirements

Inference (INT4 from HuggingFace)

  • Model Size: ~370GB (62 safetensors shards, INT4 quantized)
  • Minimum: 247GB combined RAM+VRAM+Disk
  • Optimal: 370GB+ RAM+VRAM for 5+ tokens/s
  • Budget Setup: 1x 24GB GPU + 256GB RAM (~1-2 tokens/s)
  • High Performance: 4x A100 80GB or 8x A100 40GB

Alternative: GGUF Quantizations (Unsloth)

  • 1.66-bit (UD-TQ1_0): 245GB - fits on 247GB combined RAM+VRAM
  • 2.71-bit (UD-Q2_K_XL): 381GB - recommended for accuracy
  • 4.5-bit (UD-Q4_K_XL): 588GB - near full precision

QLoRA Training

  • VRAM: ~500GB total (370GB model + 130GB activations)
  • GPUs: 4x A100 80GB or 8x A100 40GB
  • Training Time: 4-8 hours for 1000 steps
  • Output: LoRA adapters (~100MB)

Format Availability

Current

  • βœ… SafeTensors (BF16, full precision)
  • βœ… INT4 Quantized (native QAT)

Coming Soon

  • πŸ”„ GGUF quantizations (Q4_K_M, Q5_K_M, Q8_0)
  • πŸ”„ MLX optimized formats (4-bit, 8-bit for Apple Silicon)
  • πŸ”„ ONNX export for edge deployment

Special Features

1. Thinking Mode

  • Chain-of-thought reasoning with <think> tags
  • Explicit reasoning traces
  • Up to 128K thinking tokens per step
  • Adaptive depth based on problem complexity

2. Tool-Calling Agent

  • 200-300 sequential tool invocations
  • No human intervention required
  • Dynamic tool selection
  • Error recovery and retry logic

3. Parallel Reasoning (Heavy Mode)

  • 8 simultaneous reasoning trajectories
  • Reflective aggregation of outputs
  • Consensus-based answer selection
  • 2-3x accuracy improvement on hard problems

4. Multi-Modal Extensions

  • Vision-language understanding (future)
  • Audio processing (future)
  • Code β†’ execution β†’ analysis loops

Limitations

  1. Thinking Token Overhead: Extended reasoning increases latency
  2. Tool Call Limits: 300 steps may not suffice for extremely complex tasks
  3. Context Management: Auto-hiding may lose important intermediate results
  4. Quantization: INT4 optimized, but BF16 still preferred for maximum accuracy

Training Data

  • Base Training: Kimi K2 Thinking pre-training corpus
  • Zen Fine-Tuning:
    • Zoo-Gym framework with RAIS technology
    • Constitutional AI alignment data
    • Multi-turn tool-calling trajectories
    • Agentic workflow demonstrations
  • Verification: Human expert validation on HLE, AIME, coding tasks

Citation

@misc{zenmax2025,
  title={Zen Max: Reasoning-First Language Model with Test-Time Scaling},
  author={Hanzo AI and Zoo Labs Foundation},
  year={2025},
  url={https://zenlm.org},
  note={Based on Moonshot AI Kimi K2 Thinking architecture}
}

Acknowledgments

  • Moonshot AI: K2 Thinking architecture and training methodology
  • Hanzo AI: Constitutional AI training and Zen identity
  • Zoo Labs Foundation: Open AI research and community governance

Links


Zen AI: Clarity Through Intelligence
Now with reasoning at test-time

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for zenlm/zen-max

Quantized
(12)
this model