ARIES-RLVR: Advanced Reasoning Model

A Qwen3-4B model enhanced with Tree-of-Thoughts (ToT) reasoning and RLVR-inspired training techniques for improved multi-step problem solving.

🎯 Overview

This model extends the base Qwen3-4B with advanced reasoning capabilities through a custom training pipeline that combines tree-based exploration with reward-guided learning. The goal is to improve performance on complex reasoning tasks that require multi-step thinking and logical deduction.

🧠 Training Methodology

Core Components

1. Tree-of-Thoughts (ToT) Generation

  • Multi-branch exploration of reasoning paths
  • Adaptive depth based on problem complexity
  • Parallel evaluation of alternative solutions
  • Best-path selection using learned heuristics

2. Multi-Agent Policy System

Four specialized reasoning agents work in parallel:

  • Conservative Agent: Focuses on proven, reliable reasoning patterns
  • Exploratory Agent: Seeks novel solution approaches
  • Balanced Agent: Optimizes between exploration and exploitation
  • Reflective Agent: Validates and self-corrects reasoning chains

3. Adaptive Complexity Scaling

  • Automatically adjusts search depth (1-3 levels) based on problem difficulty
  • Dynamic branching factor (2-5 branches)
  • Early stopping when high-confidence solutions are found
  • Resource allocation proportional to problem complexity

4. Hybrid Reward System

Multi-objective reward function combining:

  • Correctness: Answer accuracy
  • Format Quality: Proper structure and presentation
  • Semantic Coherence: Logical consistency in reasoning steps

Training Process

  1. Problem Analysis: Difficulty assessment and strategy selection
  2. ToT Exploration: Multi-path reasoning with adaptive depth
  3. Path Evaluation: Quality scoring of reasoning chains
  4. Supervised Fine-Tuning: Training on high-quality reasoning examples
  5. Reward Optimization: RLVR-style reinforcement of effective patterns

πŸ“Š Training Data

Trained on diverse reasoning tasks from multiple domains:

  • HuggingFaceH4/MATH-500: Mathematical problem solving
  • openbmb/RLPR-Train-Dataset: Reward-based reasoning patterns
  • ai2lumos/lumos_multimodal_ground_iterative: Multi-step reasoning
  • SAGI-1/reasoningData_200k: General reasoning tasks

Data selection emphasizes:

  • High variance in difficulty
  • Multiple reasoning types (math, logic, pattern recognition)
  • Problems requiring multi-step solutions

πŸš€ Usage

Basic Generation

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ziadrone/airesupdated-v3",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "ziadrone/airesupdated-v3",
    trust_remote_code=True
)

prompt = "Solve step by step: If 5 machines take 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
    top_p=0.9
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Recommended Settings

For best reasoning performance:

  • Temperature: 0.6-0.8 (balance between creativity and consistency)
  • Max tokens: 256-512 (allow space for detailed reasoning)
  • Top-p: 0.9 (diverse but focused generation)
  • Do sample: True (enable exploration of reasoning paths)

πŸŽ“ Key Features

Strengths

  • Complex reasoning: Excels at multi-step problems requiring decomposition
  • Logical deduction: Strong performance on syllogistic reasoning
  • Pattern recognition: Effective at identifying and extending sequences
  • Self-correction: Able to validate and revise reasoning chains

Intended Use Cases

  • Mathematical problem solving
  • Logical reasoning and deduction
  • Pattern recognition and completion
  • Multi-step planning tasks
  • Educational applications requiring detailed explanations

Limitations

  • Still under active evaluation
  • Performance varies with problem domain
  • May occasionally over-explain simple problems
  • Reasoning quality depends on prompt clarity

βš™οΈ Technical Details

Model Specifications

  • Base Model: Qwen3-4B
  • Parameters: ~4 billion
  • Precision: BFloat16
  • Context Length: Inherited from base (typically 32K tokens)
  • Architecture: Decoder-only transformer

Training Infrastructure

  • Hardware: NVIDIA RTX A6000 (48GB VRAM)
  • Framework: PyTorch + Transformers
  • Optimization: AdamW with gradient accumulation
  • Mixed Precision: Enabled for memory efficiency

Training Configuration

  • Adaptive temperature (0.6-1.0) during exploration
  • Multi-episode training per sample
  • Gradient checkpointing for memory optimization
  • Checkpoint saving at regular intervals

πŸ”¬ Research Background

This model implements techniques inspired by:

  • ARIES (Autonomous Reasoning with Interactive Thought Graphs): Multi-path reasoning exploration
  • RLVR (Reinforcement Learning with Verifiable Rewards): Reward-guided training on reasoning tasks
  • Tree-of-Thoughts: Systematic exploration of reasoning chains

The training methodology represents an experimental approach to improving reasoning capabilities in language models through structured exploration and reward-based learning.

πŸ“‹ Evaluation

Currently under active evaluation across multiple benchmarks including:

  • Mathematical reasoning (MATH, GSM8K)
  • Logical reasoning (ARC, HellaSwag)
  • Common sense reasoning
  • Multi-step problem solving

Results will be updated as evaluation completes.

πŸ”„ Version History

  • v3: Current version with improved ToT exploration and multi-agent policies
  • Active development - check back for updates

⚠️ Important Notes

Development Status

This model is part of ongoing research into reasoning enhancement techniques. Performance characteristics are still being evaluated and may be updated as testing continues.

Reproducibility

The training pipeline is designed to be reproducible. Similar improvements have been observed across multiple training runs, indicating stable methodology.

Comparison with Base

Fine-tuned specifically for reasoning tasks. For general-purpose use cases, the base Qwen3-4B may be more appropriate.

πŸ“„ License

Apache 2.0 - Inherits license from base Qwen3-4B model.

πŸ™ Acknowledgments

  • Qwen Team at Alibaba Cloud for the excellent base model
  • ARIES and RLVR research teams for methodological inspiration
  • Open-source reasoning datasets used in training

πŸ“§ Contact & Feedback

This is an experimental research model. Feedback on reasoning quality and performance is welcome and will help guide future improvements.


Note: This model is optimized for reasoning tasks. For simple queries or general conversation, standard Qwen3-4B may be more efficient. Performance is best on problems that benefit from structured, multi-step reasoning.

evals are in process with comparisons

Downloads last month
415
Safetensors
Model size
4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for ziadrone/airesupdated-v2

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Finetuned
(300)
this model
Quantizations
1 model