π§ LoggenixMoE315M: A Mid-Sized MoE Language Model (16E2A, 4K Context)
π Model Card
LoggenixMoE315M is a 315M parameter Mixture-of-Experts (MoE) Causal Language Model trained from scratch on a multi-task dataset featuring root cause analysis, code generation, instruction-following, and agentic reasoning tasks.
- Architecture: Transformer with Qwen3-style MoE routing and extended 4K context capability.
 - Parameter Count: 315M total, with 2 experts active per token (approx. 182M active).
 - Experts: 16 in total, selected using token-level top-2 routing.
 - Special Tokens: Includes custom 
<think>and<tool_call>tokens to support planning and tool-based interactions for agentic AI use-cases. 
π Training Details
| Attribute | Value | 
|---|---|
| Total Params | 315M | 
| MoE Config | 16 experts, top-2 gating | 
| Dataset Type | RCA, agent reasoning, code, instruction tasks | 
| Training Epochs | 5 | 
| Effective Tokens Seen | ~ 5 Billion | 
| Train Loss (final) | 1.568 | 
| Mean Token Accuracy | ~77.6% | 
| Samples/sec | 10.04 | 
| Steps/sec | 0.314 | 
| Optimizer | AdamW | 
| Scheduler | Linear Warmup + Cosine Decay | 
| Precision | FP16 with GradScaler | 
| Checkpoint Format | HF-compatible | 
| Training Cost | ~$108 on Modal (H200) + ~20$ 7* RTX 4090 system on Hyperbolic + ~10$ for synthetic data generation + ~10$ for eval + rest for failed attempts | 
| Context Length | 4096 | 
π Standard Benchmark Results
To provide context, here's how Loggenix-MoE-0.3B compares to a typical open-source model of a similar size (e.g., GPT-2 Small, 124M parameters). While not state-of-the-art on all general tasks, the model shows promising gains with few-shot prompting and strong performance on its specific training domains.
| Task | 0-Shot | 5-Shot | GPT-2 (124M) | 
|---|---|---|---|
| ARC-Challenge | 24% | 25% | 19% | 
| ARC-Easy | 30% | 40% | 28% | 
| BoolQ | 59% | 40% | 55% | 
| GSM8K | 0% | 0% | 0% | 
| HellaSwag | 27% | 25% | 28% | 
| OpenBookQA | 9% | 0% | 15% | 
| PIQA | 53% | 70% | 65% | 
| Winogrande | 52% | 60% | 50% | 
Note: The 0% score on GSM8K is expected, as the model was not trained on complex mathematical reasoning datasets. Its strengths lie in logical and text-based tasks, not symbolic manipulation.
π€ Synthetic Task Performance
These tasks simulate real-world infra and agent use cases, with scores reflecting performance on a custom synthetic benchmark dataset (on a scale of 0.0 to 1.0).
- π§ Chain-of-Thought Reasoning: 0.80
 - π Log Error Detection: 0.80
 - π§βπ« LLM Eval Reasoning: 0.80
 - π Python Function Calling: 0.80
 - π§ Think Token Trace Generation: 0.60
 - π RCA, RAG, Observability, SLI/SLO: 0.40β0.60
 
π§ͺ Intended Use
β Suitable for:
- Instruction-following & logic-based Q&A
 - Root cause analysis and structured summarization
 - Lightweight code generation and debugging
 - Long-context reasoning (up to 4K tokens)
 - Foundation for edge deployable AI agents
 - Tool-augmented or chain-of-thought agents via 
<think>/<tool_call>tokens 
π« Not suitable for:
- Open-domain factual QA at scale
 - Tasks requiring multi-modal reasoning
 - Safety-critical or real-time systems without human validation
 
𧨠Example Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.3B-A0.1B-e3-lr7e5-b16-4090-v5-finetuned")
model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.3B-A0.1B-e3-lr7e5-b16-4090-v5-finetuned", device_map="auto")
tokenizer.pad_token = tokenizer.eos_token
messages = [
   {"role": "system", "content": ""},
   {"role": "user", "content": "Summarize the root cause of a database timeout error."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
        use_cache=False
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π§ Expert Routing This model employs top-2 token-level expert routing using router logits.
Routing is guided by auxiliary loss to balance expert load and improve convergence.
Enable output_router_logits=True in your inference pipeline for advanced inspection or analysis of routing behavior.
π License Released under the Apache 2.0 License for both commercial and research purposes.
π Acknowledgements Trained using:
𧨠Hugging Face Transformers
πΎ 7* RTX 4090 (24GB VRAM) and H200 (140GB VRAM)
π§ Custom gradient checkpointing and mixed-precision pipeline
π Logged via Weights & Biases
π£οΈ Citation @misc{loggenix-moe-0.3B-A0.1B-e3-lr7e5-b16-4090-v5-finetuned, title = {LoggenixMoE315M: A Mid-Sized Mixture-of-Experts Model with 16E2A Routing}, author = {Kshitij Thakkar}, year = {2025}, url = {https://huggingface.co/kshitijthakkar/loggenix-moe-0.3B-A0.1B-e3-lr7e5-b16-4090-v5.1-finetuned}, note = {Trained from scratch on agentic reasoning + RCA + code datasets.} }
- Downloads last month
 - 7
 


