Troviku-1.1

Model Card

Model Details

Organization: OpenTrouter
Model Type: Autoregressive Transformer Language Model
Model Version: 1.1.0
Release Date: January 15, 2025
Model License: Apache 2.0
Languages: Multi-language (25+ programming languages)
Model Size: 7 billion parameters
Context Length: 8,192 tokens
Base Model: Llama-2-7b-hf

Model Description

Troviku-1.1 is the inaugural model in the Troviku series, a family of large language models specifically engineered for advanced code generation, analysis, and software development tasks. Built on a transformer architecture with 7 billion parameters, the model has been extensively trained on high-quality code repositories, technical documentation, and algorithmic implementations. Troviku-1.1 represents a significant advancement in AI-assisted programming, offering state-of-the-art performance across multiple programming languages and software engineering paradigms.

Developed by: OpenTrouter Research Team
Funded by: OpenTrouter Inc., with compute support from cloud infrastructure partners
Model Family: Troviku series
Base Architecture: Transformer decoder with multi-head attention
Training Framework: PyTorch 2.1 with DeepSpeed ZeRO-3
Fine-tuning Methods: Supervised fine-tuning (SFT) + Reinforcement Learning from Human Feedback (RLHF)

Intended Use

Primary Use Cases:

Code generation and autocomplete in IDE environments
Algorithm implementation and optimization
Code translation between programming languages
Debugging and error resolution assistance
Technical documentation generation
Code review and quality assessment
Test case generation and validation
Educational programming assistance

Intended Users:

Professional software developers and engineers
Computer science students and educators
DevOps and infrastructure engineers
Data scientists and ML engineers
Open-source contributors
Technical writers and documentation specialists

Out-of-Scope Uses:

Generating malicious code, exploits, or malware
Creating code for illegal activities or bypassing security measures
Production-critical systems without human review and testing
Medical diagnosis or treatment recommendation systems
Legal document generation or legal advice
Financial trading algorithms without regulatory compliance review
Autonomous systems where failures could cause physical harm

Training Data

Data Sources

The model was trained on a carefully curated dataset comprising:

The Stack v2 (50% of training data)
- Source: bigcode/the-stack-v2
- Permissively licensed source code from GitHub
- 3.8 million repositories across 600+ programming languages
- Focus on top 25 languages with quality filtering
- License: MIT, Apache 2.0, BSD-3-Clause
GitHub Code Dataset (30% of training data)
- Source: codeparrot/github-code
- Curated code snippets and functions
- High-quality repositories with active maintenance
- Filtered for code quality and documentation
- License: Multiple open-source licenses
Technical Documentation (10% of training data)
- Official language documentation (Python, JavaScript, Java, C++, etc.)
- API references and SDK documentation
- Framework and library documentation
- License: CC BY 4.0, MIT, Apache 2.0
Benchmark Datasets (5% of training data)
- HumanEval: openai/humaneval
- MBPP: google-research-datasets/mbpp
- CodeContests: deepmind/code_contests
- License: MIT, Apache 2.0
Educational Content (5% of training data)
- Programming tutorials and guides
- Algorithm explanations and implementations
- Stack Overflow posts under CC BY-SA 4.0
- License: CC BY-SA 4.0

Total Training Tokens: 500 billion tokens
Training Duration: 45 days on 512 NVIDIA A100 GPUs
Dataset Size: Approximately 2.3 TB of text data
Languages Covered: Python, JavaScript, TypeScript, Java, C, C++, C#, Go, Rust, Ruby, PHP, Swift, Kotlin, Scala, R, SQL, HTML, CSS, Bash, PowerShell, Lua, Perl, Haskell, Julia, MATLAB

Data Preprocessing

Quality Filtering:

Removed repositories with fewer than 10 stars or inactive for over 2 years
Filtered out code with syntax errors or poor quality metrics
Removed duplicates and near-duplicates using MinHash LSH
Excluded code containing profanity, hate speech, or toxic content

Privacy Protection:

Scanned for and removed personally identifiable information (PII)
Filtered out API keys, passwords, and credentials
Removed private email addresses and phone numbers
Excluded internal company code and proprietary information

License Compliance:

Verified all source code adheres to permissive open-source licenses
Excluded GPL and other copyleft-licensed code to prevent license contamination
Maintained attribution records for all training sources
Regular audits to ensure compliance with license terms

Bias Mitigation:

Balanced representation across programming languages
Included code from diverse geographic regions and communities
Filtered out code with discriminatory variable names or comments
Ensured representation of different coding styles and paradigms

Training Procedure

Phase 1: Pretraining (35 days)

Objective: Causal language modeling on code corpus
Batch size: 4 million tokens per batch
Learning rate: 3e-4 with cosine decay
Optimizer: AdamW (β1=0.9, β2=0.95, ε=1e-8)
Weight decay: 0.1
Gradient clipping: 1.0
Mixed precision: bfloat16

Phase 2: Supervised Fine-tuning (7 days)

Dataset: 150,000 high-quality code examples with human annotations
Focus areas: Code quality, security, best practices
Task types: Generation, completion, translation, debugging
Evaluation: Held-out validation set with expert review

Phase 3: RLHF (3 days)

Reward model trained on 50,000 human preference comparisons
PPO optimization with KL penalty (β=0.01)
Focus: Code correctness, safety, and alignment with user intent

Performance

Benchmark Results

Benchmark	Dataset	Metric	Score
HumanEval	openai/humaneval	pass@1	72.0%
HumanEval	openai/humaneval	pass@10	89.0%
MBPP	mbpp	pass@1	68.0%
MBPP	mbpp	pass@10	84.0%
CodeContests	deepmind/code_contests	pass@1	45.0%
MultiPL-E	Python	pass@1	72.0%
MultiPL-E	JavaScript	pass@1	68.0%
MultiPL-E	Java	pass@1	65.0%
MultiPL-E	C++	pass@1	61.0%
DS-1000	Data Science	pass@1	58.0%

Performance by Language

Language	Pass@1	Pass@10	Notes
Python	72.0%	88.0%	Strongest performance
JavaScript	68.0%	85.0%	Web development focused
TypeScript	67.0%	84.0%	Type-safe JS variant
Java	65.0%	82.0%	Enterprise applications
C++	61.0%	78.0%	System programming
Rust	58.0%	75.0%	Memory safety focused
Go	64.0%	80.0%	Concurrent programming
Ruby	59.0%	74.0%	Web frameworks
PHP	60.0%	76.0%	Web development
Swift	56.0%	72.0%	iOS development

Comparison to Other Models

Model	HumanEval Pass@1	MBPP Pass@1	Parameters
GPT-4-turbo	84.0%	80.0%	Unknown
Claude-3.5-Sonnet	82.0%	78.0%	Unknown
Troviku-1.1	72.0%	68.0%	7B
CodeLlama-34B	68.0%	62.0%	34B
StarCoder2-15B	66.0%	60.0%	15B
WizardCoder-15B	64.0%	58.0%	15B

Quick Start

Installation

pip install troviku-client transformers torch

Using Transformers Library

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "OpenTrouter/Troviku-1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "def calculate_fibonacci(n):\n    "
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(code)

Using Troviku Client

from troviku_client import TrovikuClient, Language

client = TrovikuClient(api_key="your_api_key")

response = client.generate(
    prompt="Create a binary search tree implementation with insert and search methods",
    language=Language.PYTHON,
    max_tokens=1024
)

print(response.code)

API Integration

import requests

url = "https://api.opentrouter.ai/v1/chat/completions"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

payload = {
    "model": "OpenTrouter/Troviku-1.1",
    "messages": [
        {"role": "user", "content": "Write a function to calculate Fibonacci numbers"}
    ],
    "temperature": 0.7
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

Model Architecture

Architecture Type: Transformer Decoder
Number of Layers: 32
Hidden Size: 4096
Attention Heads: 32
Key-Value Heads: 8 (Grouped Query Attention)
Intermediate Size: 14336
Activation Function: SiLU (Swish)
Vocabulary Size: 32,768 tokens
Positional Encoding: RoPE (Rotary Position Embedding)
Normalization: RMSNorm
Precision: bfloat16

Hardware Requirements

Minimum Requirements

GPU: 16GB VRAM (e.g., NVIDIA RTX 4090, A10)
RAM: 32GB system memory
Storage: 20GB for model weights

Recommended Requirements

GPU: 24GB+ VRAM (e.g., NVIDIA A100, RTX 6000 Ada)
RAM: 64GB system memory
Storage: 50GB for model, cache, and datasets

Quantization Support

int8: 8GB VRAM, 2x faster inference
int4: 4GB VRAM, 4x faster inference
GPTQ: Optimized 4-bit quantization
AWQ: Activation-aware quantization

Limitations

Technical Limitations

Context window limited to 8,192 tokens
May generate syntactically correct but logically flawed code
Performance degrades on very specialized or proprietary frameworks
Limited understanding of complex multi-file codebases
May not always follow organization-specific coding standards

Language-Specific Limitations

Stronger performance on popular languages (Python, JavaScript, Java)
Weaker performance on rare or legacy languages
Limited knowledge of cutting-edge language features released after training cutoff
May struggle with highly domain-specific DSLs

Safety Considerations

Generated code should always be reviewed by experienced developers
Security-critical code requires thorough security audits
May inadvertently suggest vulnerable code patterns
Not suitable for safety-critical systems without extensive testing

Bias Considerations

May reflect biases present in training data (e.g., over-representation of certain coding styles)
Training data predominantly from English-language repositories
Potential underrepresentation of non-Western coding conventions
May perpetuate historical biases in variable naming and comments

Ethical Considerations

Environmental Impact

Training Emissions: Approximately 25 tons CO2 equivalent
Mitigation: Used renewable energy data centers, carbon offset programs
Inference Efficiency: Optimized for low-latency, energy-efficient deployment

Attribution and Licensing

All training data sourced from permissively licensed repositories
Respects original authors' licensing terms
Provides attribution capabilities in generated code comments
Excludes copyleft-licensed code to prevent license contamination

Dual-Use Concerns

The model could potentially be misused for:

Generating malicious code or exploits
Automating spam or phishing campaigns
Creating code to circumvent security measures

Mitigation Strategies:

Refusal training for malicious code generation requests
Usage monitoring and rate limiting
Terms of service enforcement
Community reporting mechanisms
Collaboration with security researchers

License

This model is released under the Apache License 2.0.

License Terms Summary

Commercial Use: Permitted
Modification: Permitted
Distribution: Permitted
Patent Use: Permitted
Private Use: Permitted

Conditions:

License and copyright notice must be included
State changes made to the code
Provide attribution to original authors

Limitations:

No trademark use
No liability or warranty

See the LICENSE file for full details.

Citation

If you use Troviku-1.1 in your research or projects, please cite:

@misc{troviku2025,
  title={Troviku-1.1: A Specialized Code Generation Model},
  author={OpenTrouter Research Team},
  year={2025},
  publisher={OpenTrouter},
  howpublished={\url{https://github.com/OpenTrouter/Troviku-1.1}},
  note={Apache License 2.0}
}