🧠 LoggenixMoE315M: A Mid-Sized MoE Language Model (16E2A, 4K Context)

📝 Model Card

LoggenixMoE315M is a 315M parameter Mixture-of-Experts (MoE) Causal Language Model trained from scratch on a multi-task dataset featuring root cause analysis, code generation, instruction-following, and agentic reasoning tasks.

Architecture: Transformer with Qwen3-style MoE routing and extended 4K context capability.
Parameter Count: 315M total, with 2 experts active per token (approx. 182M active).
Experts: 16 in total, selected using token-level top-2 routing.
Special Tokens: Includes custom <think> and <tool_call> tokens to support planning and tool-based interactions for agentic AI use-cases.

📊 Training Details

Attribute	Value
Total Params	315M
MoE Config	16 experts, top-2 gating
Dataset Type	RCA, agent reasoning, code, instruction tasks
Training Epochs	5
Effective Tokens Seen	~ 5 Billion
Train Loss (final)	1.568
Mean Token Accuracy	~77.6%
Samples/sec	10.04
Steps/sec	0.314
Optimizer	AdamW
Scheduler	Linear Warmup + Cosine Decay
Precision	FP16 with GradScaler
Checkpoint Format	HF-compatible
Training Cost	~$108 on Modal (H200) + ~20$ 7* RTX 4090 system on Hyperbolic + ~10$ for synthetic data generation + ~10$ for eval + rest for failed attempts
Context Length	4096

📈 Standard Benchmark Results

To provide context, here's how Loggenix-MoE-0.3B compares to a typical open-source model of a similar size (e.g., GPT-2 Small, 124M parameters). While not state-of-the-art on all general tasks, the model shows promising gains with few-shot prompting and strong performance on its specific training domains.

Task	0-Shot	5-Shot	GPT-2 (124M)
ARC-Challenge	24%	25%	19%
ARC-Easy	30%	40%	28%
BoolQ	59%	40%	55%
GSM8K	0%	0%	0%
HellaSwag	27%	25%	28%
OpenBookQA	9%	0%	15%
PIQA	53%	70%	65%
Winogrande	52%	60%	50%

Note: The 0% score on GSM8K is expected, as the model was not trained on complex mathematical reasoning datasets. Its strengths lie in logical and text-based tasks, not symbolic manipulation.

📊 Benchmark Chart:

🤖 Synthetic Task Performance

These tasks simulate real-world infra and agent use cases, with scores reflecting performance on a custom synthetic benchmark dataset (on a scale of 0.0 to 1.0).

🧠 Chain-of-Thought Reasoning: 0.80
🔍 Log Error Detection: 0.80
🧑‍🏫 LLM Eval Reasoning: 0.80
🛠 Python Function Calling: 0.80
🧠 Think Token Trace Generation: 0.60
📉 RCA, RAG, Observability, SLI/SLO: 0.40–0.60

📊 Performance Chart:

🧪 Intended Use

✅ Suitable for:

Instruction-following & logic-based Q&A
Root cause analysis and structured summarization
Lightweight code generation and debugging
Long-context reasoning (up to 4K tokens)
Foundation for edge deployable AI agents
Tool-augmented or chain-of-thought agents via <think> / <tool_call> tokens

🚫 Not suitable for:

Open-domain factual QA at scale
Tasks requiring multi-modal reasoning
Safety-critical or real-time systems without human validation

🧨 Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.3B-A0.1B-e3-lr7e5-b16-4090-v5-finetuned")
model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.3B-A0.1B-e3-lr7e5-b16-4090-v5-finetuned", device_map="auto")

tokenizer.pad_token = tokenizer.eos_token

messages = [
   {"role": "system", "content": ""},
   {"role": "user", "content": "Summarize the root cause of a database timeout error."}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
        use_cache=False
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🔧 Expert Routing This model employs top-2 token-level expert routing using router logits.

Routing is guided by auxiliary loss to balance expert load and improve convergence.

Enable output_router_logits=True in your inference pipeline for advanced inspection or analysis of routing behavior.

📃 License Released under the Apache 2.0 License for both commercial and research purposes.

🙌 Acknowledgements Trained using:

🧨 Hugging Face Transformers

💾 7* RTX 4090 (24GB VRAM) and H200 (140GB VRAM)

🧠 Custom gradient checkpointing and mixed-precision pipeline

📈 Logged via Weights & Biases

🗣️ Citation @misc{loggenix-moe-0.3B-A0.1B-e3-lr7e5-b16-4090-v5-finetuned, title = {LoggenixMoE315M: A Mid-Sized Mixture-of-Experts Model with 16E2A Routing}, author = {Kshitij Thakkar}, year = {2025}, url = {https://huggingface.co/kshitijthakkar/loggenix-moe-0.3B-A0.1B-e3-lr7e5-b16-4090-v5.1-finetuned}, note = {Trained from scratch on agentic reasoning + RCA + code datasets.} }

Downloads last month: 7

Safetensors

Model size

0.3B params

Tensor type

BF16

Model tree for kshitijthakkar/loggenix-moe-0.3B-A0.1B-e3-lr7e5-b16-4090-v5.1-finetuned

Quantizations

2 models

Collection including kshitijthakkar/loggenix-moe-0.3B-A0.1B-e3-lr7e5-b16-4090-v5.1-finetuned

Loggenix-MOE

Collection

Collection of Loggenix Models, Eval Dataset, Demo Playground. Soon will add the training dataset. • 11 items • Updated about 1 month ago • 1