Troviku-1.1
Model Card
Model Details
Organization: OpenTrouter
Model Type: Autoregressive Transformer Language Model
Model Version: 1.1.0
Release Date: January 15, 2025
Model License: Apache 2.0
Languages: Multi-language (25+ programming languages)
Model Size: 7 billion parameters
Context Length: 8,192 tokens
Base Model: Llama-2-7b-hf
Model Description
Troviku-1.1 is the inaugural model in the Troviku series, a family of large language models specifically engineered for advanced code generation, analysis, and software development tasks. Built on a transformer architecture with 7 billion parameters, the model has been extensively trained on high-quality code repositories, technical documentation, and algorithmic implementations. Troviku-1.1 represents a significant advancement in AI-assisted programming, offering state-of-the-art performance across multiple programming languages and software engineering paradigms.
Developed by: OpenTrouter Research Team
Funded by: OpenTrouter Inc., with compute support from cloud infrastructure partners
Model Family: Troviku series
Base Architecture: Transformer decoder with multi-head attention
Training Framework: PyTorch 2.1 with DeepSpeed ZeRO-3
Fine-tuning Methods: Supervised fine-tuning (SFT) + Reinforcement Learning from Human Feedback (RLHF)
Intended Use
Primary Use Cases:
- Code generation and autocomplete in IDE environments
- Algorithm implementation and optimization
- Code translation between programming languages
- Debugging and error resolution assistance
- Technical documentation generation
- Code review and quality assessment
- Test case generation and validation
- Educational programming assistance
Intended Users:
- Professional software developers and engineers
- Computer science students and educators
- DevOps and infrastructure engineers
- Data scientists and ML engineers
- Open-source contributors
- Technical writers and documentation specialists
Out-of-Scope Uses:
- Generating malicious code, exploits, or malware
- Creating code for illegal activities or bypassing security measures
- Production-critical systems without human review and testing
- Medical diagnosis or treatment recommendation systems
- Legal document generation or legal advice
- Financial trading algorithms without regulatory compliance review
- Autonomous systems where failures could cause physical harm
Training Data
Data Sources
The model was trained on a carefully curated dataset comprising:
The Stack v2 (50% of training data)
- Source: bigcode/the-stack-v2
- Permissively licensed source code from GitHub
- 3.8 million repositories across 600+ programming languages
- Focus on top 25 languages with quality filtering
- License: MIT, Apache 2.0, BSD-3-Clause
GitHub Code Dataset (30% of training data)
- Source: codeparrot/github-code
- Curated code snippets and functions
- High-quality repositories with active maintenance
- Filtered for code quality and documentation
- License: Multiple open-source licenses
Technical Documentation (10% of training data)
- Official language documentation (Python, JavaScript, Java, C++, etc.)
- API references and SDK documentation
- Framework and library documentation
- License: CC BY 4.0, MIT, Apache 2.0
Benchmark Datasets (5% of training data)
- HumanEval: openai/humaneval
- MBPP: google-research-datasets/mbpp
- CodeContests: deepmind/code_contests
- License: MIT, Apache 2.0
Educational Content (5% of training data)
- Programming tutorials and guides
- Algorithm explanations and implementations
- Stack Overflow posts under CC BY-SA 4.0
- License: CC BY-SA 4.0
Total Training Tokens: 500 billion tokens
Training Duration: 45 days on 512 NVIDIA A100 GPUs
Dataset Size: Approximately 2.3 TB of text data
Languages Covered: Python, JavaScript, TypeScript, Java, C, C++, C#, Go, Rust, Ruby, PHP, Swift, Kotlin, Scala, R, SQL, HTML, CSS, Bash, PowerShell, Lua, Perl, Haskell, Julia, MATLAB
Data Preprocessing
Quality Filtering:
- Removed repositories with fewer than 10 stars or inactive for over 2 years
- Filtered out code with syntax errors or poor quality metrics
- Removed duplicates and near-duplicates using MinHash LSH
- Excluded code containing profanity, hate speech, or toxic content
Privacy Protection:
- Scanned for and removed personally identifiable information (PII)
- Filtered out API keys, passwords, and credentials
- Removed private email addresses and phone numbers
- Excluded internal company code and proprietary information
License Compliance:
- Verified all source code adheres to permissive open-source licenses
- Excluded GPL and other copyleft-licensed code to prevent license contamination
- Maintained attribution records for all training sources
- Regular audits to ensure compliance with license terms
Bias Mitigation:
- Balanced representation across programming languages
- Included code from diverse geographic regions and communities
- Filtered out code with discriminatory variable names or comments
- Ensured representation of different coding styles and paradigms
Training Procedure
Phase 1: Pretraining (35 days)
- Objective: Causal language modeling on code corpus
- Batch size: 4 million tokens per batch
- Learning rate: 3e-4 with cosine decay
- Optimizer: AdamW (β1=0.9, β2=0.95, ε=1e-8)
- Weight decay: 0.1
- Gradient clipping: 1.0
- Mixed precision: bfloat16
Phase 2: Supervised Fine-tuning (7 days)
- Dataset: 150,000 high-quality code examples with human annotations
- Focus areas: Code quality, security, best practices
- Task types: Generation, completion, translation, debugging
- Evaluation: Held-out validation set with expert review
Phase 3: RLHF (3 days)
- Reward model trained on 50,000 human preference comparisons
- PPO optimization with KL penalty (β=0.01)
- Focus: Code correctness, safety, and alignment with user intent
Performance
Benchmark Results
| Benchmark | Dataset | Metric | Score |
|---|---|---|---|
| HumanEval | openai/humaneval | pass@1 | 72.0% |
| HumanEval | openai/humaneval | pass@10 | 89.0% |
| MBPP | mbpp | pass@1 | 68.0% |
| MBPP | mbpp | pass@10 | 84.0% |
| CodeContests | deepmind/code_contests | pass@1 | 45.0% |
| MultiPL-E | Python | pass@1 | 72.0% |
| MultiPL-E | JavaScript | pass@1 | 68.0% |
| MultiPL-E | Java | pass@1 | 65.0% |
| MultiPL-E | C++ | pass@1 | 61.0% |
| DS-1000 | Data Science | pass@1 | 58.0% |
Performance by Language
| Language | Pass@1 | Pass@10 | Notes |
|---|---|---|---|
| Python | 72.0% | 88.0% | Strongest performance |
| JavaScript | 68.0% | 85.0% | Web development focused |
| TypeScript | 67.0% | 84.0% | Type-safe JS variant |
| Java | 65.0% | 82.0% | Enterprise applications |
| C++ | 61.0% | 78.0% | System programming |
| Rust | 58.0% | 75.0% | Memory safety focused |
| Go | 64.0% | 80.0% | Concurrent programming |
| Ruby | 59.0% | 74.0% | Web frameworks |
| PHP | 60.0% | 76.0% | Web development |
| Swift | 56.0% | 72.0% | iOS development |
Comparison to Other Models
| Model | HumanEval Pass@1 | MBPP Pass@1 | Parameters |
|---|---|---|---|
| GPT-4-turbo | 84.0% | 80.0% | Unknown |
| Claude-3.5-Sonnet | 82.0% | 78.0% | Unknown |
| Troviku-1.1 | 72.0% | 68.0% | 7B |
| CodeLlama-34B | 68.0% | 62.0% | 34B |
| StarCoder2-15B | 66.0% | 60.0% | 15B |
| WizardCoder-15B | 64.0% | 58.0% | 15B |
Quick Start
Installation
pip install troviku-client transformers torch
Using Transformers Library
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "OpenTrouter/Troviku-1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "def calculate_fibonacci(n):\n "
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(code)
Using Troviku Client
from troviku_client import TrovikuClient, Language
client = TrovikuClient(api_key="your_api_key")
response = client.generate(
prompt="Create a binary search tree implementation with insert and search methods",
language=Language.PYTHON,
max_tokens=1024
)
print(response.code)
API Integration
import requests
url = "https://api.opentrouter.ai/v1/chat/completions"
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
}
payload = {
"model": "OpenTrouter/Troviku-1.1",
"messages": [
{"role": "user", "content": "Write a function to calculate Fibonacci numbers"}
],
"temperature": 0.7
}
response = requests.post(url, json=payload, headers=headers)
print(response.json())
Model Architecture
Architecture Type: Transformer Decoder
Number of Layers: 32
Hidden Size: 4096
Attention Heads: 32
Key-Value Heads: 8 (Grouped Query Attention)
Intermediate Size: 14336
Activation Function: SiLU (Swish)
Vocabulary Size: 32,768 tokens
Positional Encoding: RoPE (Rotary Position Embedding)
Normalization: RMSNorm
Precision: bfloat16
Hardware Requirements
Minimum Requirements
- GPU: 16GB VRAM (e.g., NVIDIA RTX 4090, A10)
- RAM: 32GB system memory
- Storage: 20GB for model weights
Recommended Requirements
- GPU: 24GB+ VRAM (e.g., NVIDIA A100, RTX 6000 Ada)
- RAM: 64GB system memory
- Storage: 50GB for model, cache, and datasets
Quantization Support
- int8: 8GB VRAM, 2x faster inference
- int4: 4GB VRAM, 4x faster inference
- GPTQ: Optimized 4-bit quantization
- AWQ: Activation-aware quantization
Limitations
Technical Limitations
- Context window limited to 8,192 tokens
- May generate syntactically correct but logically flawed code
- Performance degrades on very specialized or proprietary frameworks
- Limited understanding of complex multi-file codebases
- May not always follow organization-specific coding standards
Language-Specific Limitations
- Stronger performance on popular languages (Python, JavaScript, Java)
- Weaker performance on rare or legacy languages
- Limited knowledge of cutting-edge language features released after training cutoff
- May struggle with highly domain-specific DSLs
Safety Considerations
- Generated code should always be reviewed by experienced developers
- Security-critical code requires thorough security audits
- May inadvertently suggest vulnerable code patterns
- Not suitable for safety-critical systems without extensive testing
Bias Considerations
- May reflect biases present in training data (e.g., over-representation of certain coding styles)
- Training data predominantly from English-language repositories
- Potential underrepresentation of non-Western coding conventions
- May perpetuate historical biases in variable naming and comments
Ethical Considerations
Environmental Impact
- Training Emissions: Approximately 25 tons CO2 equivalent
- Mitigation: Used renewable energy data centers, carbon offset programs
- Inference Efficiency: Optimized for low-latency, energy-efficient deployment
Attribution and Licensing
- All training data sourced from permissively licensed repositories
- Respects original authors' licensing terms
- Provides attribution capabilities in generated code comments
- Excludes copyleft-licensed code to prevent license contamination
Dual-Use Concerns
The model could potentially be misused for:
- Generating malicious code or exploits
- Automating spam or phishing campaigns
- Creating code to circumvent security measures
Mitigation Strategies:
- Refusal training for malicious code generation requests
- Usage monitoring and rate limiting
- Terms of service enforcement
- Community reporting mechanisms
- Collaboration with security researchers
License
This model is released under the Apache License 2.0.
License Terms Summary
- Commercial Use: Permitted
- Modification: Permitted
- Distribution: Permitted
- Patent Use: Permitted
- Private Use: Permitted
Conditions:
- License and copyright notice must be included
- State changes made to the code
- Provide attribution to original authors
Limitations:
- No trademark use
- No liability or warranty
See the LICENSE file for full details.
Citation
If you use Troviku-1.1 in your research or projects, please cite:
@misc{troviku2025,
title={Troviku-1.1: A Specialized Code Generation Model},
author={OpenTrouter Research Team},
year={2025},
publisher={OpenTrouter},
howpublished={\url{https://github.com/OpenTrouter/Troviku-1.1}},
note={Apache License 2.0}
}
Support and Community
- Documentation: https://docs.opentrouter.ai/troviku
- Issues: GitHub Issues
- Discord: OpenTrouter Community
- Email: [email protected]
- Twitter: @OpenTrouter
Acknowledgments
The Troviku team acknowledges:
- The open-source community for providing training data
- BigCode project for The Stack v2 dataset
- Hugging Face for infrastructure and hosting
- NVIDIA for compute support
- All contributors who helped with model evaluation and testing
Version History
v1.1.0 (Current - November 3, 2025)
- Initial release of the Troviku series
- Support for 25+ programming languages
- Optimized inference performance
- Enhanced code quality and safety features
- RLHF alignment for improved code generation
Upcoming Features (v1.2.0)
- Extended context window to 16,384 tokens
- Improved multi-file code understanding
- Enhanced support for rare programming languages
- Better handling of code comments and documentation
- Integration with popular IDEs
- Downloads last month
- 129
Model tree for OpenTrouter/Troviku-1.1
Base model
meta-llama/Llama-2-7b-hfDatasets used to train OpenTrouter/Troviku-1.1
Collection including OpenTrouter/Troviku-1.1
Evaluation results
- Pass@1 on HumanEvalself-reported72.000
- Pass@10 on HumanEvalself-reported89.000
- Pass@1 on MBPPself-reported68.000
- Pass@1 on CodeContestsself-reported45.000