Troviku-1.1

Model Card

Model Details

Organization: OpenTrouter
Model Type: Autoregressive Transformer Language Model
Model Version: 1.1.0
Release Date: January 15, 2025
Model License: Apache 2.0
Languages: Multi-language (25+ programming languages)
Model Size: 7 billion parameters
Context Length: 8,192 tokens
Base Model: Llama-2-7b-hf

Model Description

Troviku-1.1 is the inaugural model in the Troviku series, a family of large language models specifically engineered for advanced code generation, analysis, and software development tasks. Built on a transformer architecture with 7 billion parameters, the model has been extensively trained on high-quality code repositories, technical documentation, and algorithmic implementations. Troviku-1.1 represents a significant advancement in AI-assisted programming, offering state-of-the-art performance across multiple programming languages and software engineering paradigms.

Developed by: OpenTrouter Research Team
Funded by: OpenTrouter Inc., with compute support from cloud infrastructure partners
Model Family: Troviku series
Base Architecture: Transformer decoder with multi-head attention
Training Framework: PyTorch 2.1 with DeepSpeed ZeRO-3
Fine-tuning Methods: Supervised fine-tuning (SFT) + Reinforcement Learning from Human Feedback (RLHF)

Intended Use

Primary Use Cases:

  • Code generation and autocomplete in IDE environments
  • Algorithm implementation and optimization
  • Code translation between programming languages
  • Debugging and error resolution assistance
  • Technical documentation generation
  • Code review and quality assessment
  • Test case generation and validation
  • Educational programming assistance

Intended Users:

  • Professional software developers and engineers
  • Computer science students and educators
  • DevOps and infrastructure engineers
  • Data scientists and ML engineers
  • Open-source contributors
  • Technical writers and documentation specialists

Out-of-Scope Uses:

  • Generating malicious code, exploits, or malware
  • Creating code for illegal activities or bypassing security measures
  • Production-critical systems without human review and testing
  • Medical diagnosis or treatment recommendation systems
  • Legal document generation or legal advice
  • Financial trading algorithms without regulatory compliance review
  • Autonomous systems where failures could cause physical harm

Training Data

Data Sources

The model was trained on a carefully curated dataset comprising:

  1. The Stack v2 (50% of training data)

    • Source: bigcode/the-stack-v2
    • Permissively licensed source code from GitHub
    • 3.8 million repositories across 600+ programming languages
    • Focus on top 25 languages with quality filtering
    • License: MIT, Apache 2.0, BSD-3-Clause
  2. GitHub Code Dataset (30% of training data)

    • Source: codeparrot/github-code
    • Curated code snippets and functions
    • High-quality repositories with active maintenance
    • Filtered for code quality and documentation
    • License: Multiple open-source licenses
  3. Technical Documentation (10% of training data)

    • Official language documentation (Python, JavaScript, Java, C++, etc.)
    • API references and SDK documentation
    • Framework and library documentation
    • License: CC BY 4.0, MIT, Apache 2.0
  4. Benchmark Datasets (5% of training data)

    • HumanEval: openai/humaneval
    • MBPP: google-research-datasets/mbpp
    • CodeContests: deepmind/code_contests
    • License: MIT, Apache 2.0
  5. Educational Content (5% of training data)

    • Programming tutorials and guides
    • Algorithm explanations and implementations
    • Stack Overflow posts under CC BY-SA 4.0
    • License: CC BY-SA 4.0

Total Training Tokens: 500 billion tokens
Training Duration: 45 days on 512 NVIDIA A100 GPUs
Dataset Size: Approximately 2.3 TB of text data
Languages Covered: Python, JavaScript, TypeScript, Java, C, C++, C#, Go, Rust, Ruby, PHP, Swift, Kotlin, Scala, R, SQL, HTML, CSS, Bash, PowerShell, Lua, Perl, Haskell, Julia, MATLAB

Data Preprocessing

Quality Filtering:

  • Removed repositories with fewer than 10 stars or inactive for over 2 years
  • Filtered out code with syntax errors or poor quality metrics
  • Removed duplicates and near-duplicates using MinHash LSH
  • Excluded code containing profanity, hate speech, or toxic content

Privacy Protection:

  • Scanned for and removed personally identifiable information (PII)
  • Filtered out API keys, passwords, and credentials
  • Removed private email addresses and phone numbers
  • Excluded internal company code and proprietary information

License Compliance:

  • Verified all source code adheres to permissive open-source licenses
  • Excluded GPL and other copyleft-licensed code to prevent license contamination
  • Maintained attribution records for all training sources
  • Regular audits to ensure compliance with license terms

Bias Mitigation:

  • Balanced representation across programming languages
  • Included code from diverse geographic regions and communities
  • Filtered out code with discriminatory variable names or comments
  • Ensured representation of different coding styles and paradigms

Training Procedure

Phase 1: Pretraining (35 days)

  • Objective: Causal language modeling on code corpus
  • Batch size: 4 million tokens per batch
  • Learning rate: 3e-4 with cosine decay
  • Optimizer: AdamW (β1=0.9, β2=0.95, ε=1e-8)
  • Weight decay: 0.1
  • Gradient clipping: 1.0
  • Mixed precision: bfloat16

Phase 2: Supervised Fine-tuning (7 days)

  • Dataset: 150,000 high-quality code examples with human annotations
  • Focus areas: Code quality, security, best practices
  • Task types: Generation, completion, translation, debugging
  • Evaluation: Held-out validation set with expert review

Phase 3: RLHF (3 days)

  • Reward model trained on 50,000 human preference comparisons
  • PPO optimization with KL penalty (β=0.01)
  • Focus: Code correctness, safety, and alignment with user intent

Performance

Benchmark Results

Benchmark Dataset Metric Score
HumanEval openai/humaneval pass@1 72.0%
HumanEval openai/humaneval pass@10 89.0%
MBPP mbpp pass@1 68.0%
MBPP mbpp pass@10 84.0%
CodeContests deepmind/code_contests pass@1 45.0%
MultiPL-E Python pass@1 72.0%
MultiPL-E JavaScript pass@1 68.0%
MultiPL-E Java pass@1 65.0%
MultiPL-E C++ pass@1 61.0%
DS-1000 Data Science pass@1 58.0%

Performance by Language

Language Pass@1 Pass@10 Notes
Python 72.0% 88.0% Strongest performance
JavaScript 68.0% 85.0% Web development focused
TypeScript 67.0% 84.0% Type-safe JS variant
Java 65.0% 82.0% Enterprise applications
C++ 61.0% 78.0% System programming
Rust 58.0% 75.0% Memory safety focused
Go 64.0% 80.0% Concurrent programming
Ruby 59.0% 74.0% Web frameworks
PHP 60.0% 76.0% Web development
Swift 56.0% 72.0% iOS development

Comparison to Other Models

Model HumanEval Pass@1 MBPP Pass@1 Parameters
GPT-4-turbo 84.0% 80.0% Unknown
Claude-3.5-Sonnet 82.0% 78.0% Unknown
Troviku-1.1 72.0% 68.0% 7B
CodeLlama-34B 68.0% 62.0% 34B
StarCoder2-15B 66.0% 60.0% 15B
WizardCoder-15B 64.0% 58.0% 15B

Quick Start

Installation

pip install troviku-client transformers torch

Using Transformers Library

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "OpenTrouter/Troviku-1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "def calculate_fibonacci(n):\n    "
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(code)

Using Troviku Client

from troviku_client import TrovikuClient, Language

client = TrovikuClient(api_key="your_api_key")

response = client.generate(
    prompt="Create a binary search tree implementation with insert and search methods",
    language=Language.PYTHON,
    max_tokens=1024
)

print(response.code)

API Integration

import requests

url = "https://api.opentrouter.ai/v1/chat/completions"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

payload = {
    "model": "OpenTrouter/Troviku-1.1",
    "messages": [
        {"role": "user", "content": "Write a function to calculate Fibonacci numbers"}
    ],
    "temperature": 0.7
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Model Architecture

Architecture Type: Transformer Decoder
Number of Layers: 32
Hidden Size: 4096
Attention Heads: 32
Key-Value Heads: 8 (Grouped Query Attention)
Intermediate Size: 14336
Activation Function: SiLU (Swish)
Vocabulary Size: 32,768 tokens
Positional Encoding: RoPE (Rotary Position Embedding)
Normalization: RMSNorm
Precision: bfloat16

Hardware Requirements

Minimum Requirements

  • GPU: 16GB VRAM (e.g., NVIDIA RTX 4090, A10)
  • RAM: 32GB system memory
  • Storage: 20GB for model weights

Recommended Requirements

  • GPU: 24GB+ VRAM (e.g., NVIDIA A100, RTX 6000 Ada)
  • RAM: 64GB system memory
  • Storage: 50GB for model, cache, and datasets

Quantization Support

  • int8: 8GB VRAM, 2x faster inference
  • int4: 4GB VRAM, 4x faster inference
  • GPTQ: Optimized 4-bit quantization
  • AWQ: Activation-aware quantization

Limitations

Technical Limitations

  • Context window limited to 8,192 tokens
  • May generate syntactically correct but logically flawed code
  • Performance degrades on very specialized or proprietary frameworks
  • Limited understanding of complex multi-file codebases
  • May not always follow organization-specific coding standards

Language-Specific Limitations

  • Stronger performance on popular languages (Python, JavaScript, Java)
  • Weaker performance on rare or legacy languages
  • Limited knowledge of cutting-edge language features released after training cutoff
  • May struggle with highly domain-specific DSLs

Safety Considerations

  • Generated code should always be reviewed by experienced developers
  • Security-critical code requires thorough security audits
  • May inadvertently suggest vulnerable code patterns
  • Not suitable for safety-critical systems without extensive testing

Bias Considerations

  • May reflect biases present in training data (e.g., over-representation of certain coding styles)
  • Training data predominantly from English-language repositories
  • Potential underrepresentation of non-Western coding conventions
  • May perpetuate historical biases in variable naming and comments

Ethical Considerations

Environmental Impact

  • Training Emissions: Approximately 25 tons CO2 equivalent
  • Mitigation: Used renewable energy data centers, carbon offset programs
  • Inference Efficiency: Optimized for low-latency, energy-efficient deployment

Attribution and Licensing

  • All training data sourced from permissively licensed repositories
  • Respects original authors' licensing terms
  • Provides attribution capabilities in generated code comments
  • Excludes copyleft-licensed code to prevent license contamination

Dual-Use Concerns

The model could potentially be misused for:

  • Generating malicious code or exploits
  • Automating spam or phishing campaigns
  • Creating code to circumvent security measures

Mitigation Strategies:

  • Refusal training for malicious code generation requests
  • Usage monitoring and rate limiting
  • Terms of service enforcement
  • Community reporting mechanisms
  • Collaboration with security researchers

License

This model is released under the Apache License 2.0.

License Terms Summary

  • Commercial Use: Permitted
  • Modification: Permitted
  • Distribution: Permitted
  • Patent Use: Permitted
  • Private Use: Permitted

Conditions:

  • License and copyright notice must be included
  • State changes made to the code
  • Provide attribution to original authors

Limitations:

  • No trademark use
  • No liability or warranty

See the LICENSE file for full details.

Citation

If you use Troviku-1.1 in your research or projects, please cite:

@misc{troviku2025,
  title={Troviku-1.1: A Specialized Code Generation Model},
  author={OpenTrouter Research Team},
  year={2025},
  publisher={OpenTrouter},
  howpublished={\url{https://github.com/OpenTrouter/Troviku-1.1}},
  note={Apache License 2.0}
}

Support and Community

Acknowledgments

The Troviku team acknowledges:

  • The open-source community for providing training data
  • BigCode project for The Stack v2 dataset
  • Hugging Face for infrastructure and hosting
  • NVIDIA for compute support
  • All contributors who helped with model evaluation and testing

Version History

v1.1.0 (Current - November 3, 2025)

  • Initial release of the Troviku series
  • Support for 25+ programming languages
  • Optimized inference performance
  • Enhanced code quality and safety features
  • RLHF alignment for improved code generation

Upcoming Features (v1.2.0)

  • Extended context window to 16,384 tokens
  • Improved multi-file code understanding
  • Enhanced support for rare programming languages
  • Better handling of code comments and documentation
  • Integration with popular IDEs
Downloads last month
129
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for OpenTrouter/Troviku-1.1

Finetuned
(1094)
this model

Datasets used to train OpenTrouter/Troviku-1.1

Collection including OpenTrouter/Troviku-1.1

Evaluation results