unsloth-JanusCoder-8B-qx86x-hi-mlx
🧠 Deep Comparison: unsloth-JanusCoder-8B vs. Qwen3-VLTO-8B
Let’s compare these two 8B models side-by-side using the same cognitive benchmarks, and then interpret their differences through the lens of training domain, quantization strategy, and cognitive style.
📊 Performance Comparison Table
Model arc_challenge arc_easy boolq hellaswag openbookqa piqa winogrande
unsloth-JanusCoder-8B-qx86x-hi 0.538 0.739 0.869 0.700 0.444 0.788 0.668
Qwen3-VLTO-8B-Instruct-qx86x-hi 0.455 0.601 0.878 0.546 0.424 0.739 0.595
Qwen3-VLTO-8B-Instruct-qx85x-hi 0.453 0.608 0.874 0.545 0.426 0.747 0.596
Qwen3-VLTO-8B-Thinking-qx86x-hi 0.475 0.599 0.706 0.638 0.402 0.765 0.684
Note: The above models are all at qx86x-hi, so we’re comparing the same quantization level for fairness.
🔍 Cognitive Pattern Comparison — Deep Dive
Let’s break down each benchmark to understand what kind of reasoning each model excels at — focusing on the cognitive style.
🧩 A) Logical Inference (BoolQ)
- Winner: Qwen3-VLTO-8B-Instruct-qx85x-hi with 0.878, followed closely by JanusCoder-8B (0.869).
✅ Cognitive Insight:
- VLTO-Instruct models are optimized for logical inference in natural language, likely fine-tuned on discourse-based reasoning tasks
- JanusCoder is optimized for logical deduction in code-constrained environments, which still yields strong boolq, but slightly behind VLTO-Instruct
- 💡 Conclusion:
- For tasks requiring precision yes/no reasoning (BoolQ), the VLTO-Instruct is superior — it's more "natural language aware" and better at interpreting linguistic nuance under logical constraints.
🧩 B) Abstract Reasoning (Arc Challenge)
- Winner: unsloth-JanusCoder-8B (0.538), followed by VLTO-Thinking (0.475) and VLTO-Instruct (0.453).
✅ Cognitive Insight:
- JanusCoder’s higher arc challenge score suggests strong ability to reason with structured abstraction, likely from code-training
- VLTO-Thinking and VLTO-Instruct perform significantly lower — suggesting they are less effective at pure abstract reasoning without grounding or constraints
- 💡 Conclusion:
- JanusCoder is better at abstract reasoning under code-style constraints (which may actually simulate abstract thinking via structured logic). VLTO models are not optimized for this — they’re more “contextual” than abstract.
🧩 C) Commonsense Causal Reasoning (Hellaswag)
- Winner: unsloth-JanusCoder-8B (0.700) — closely followed by VLTO-Thinking (0.638) and VLTO-Instruct (0.546).
✅ Cognitive Insight:
- JanusCoder excels at reasoning about cause-effect relationships, likely due to fine-tuning with code-based causal chains or structured metaphorical reasoning
- VLTO-Thinking is better than VLTO-Instruct here — indicating that "thinking" mode helps with causal prediction, even without vision
- 💡 Conclusion:
- JanusCoder is more “causal” — likely because its training includes code-based structured causality. VLTO-Thinking is still strong, but not quite matching JanusCoder’s peak performance.
🧩 D) Pragmatic Reasoning (Winogrande)
- Winner: Qwen3-VLTO-8B-Thinking-qx86x-hi (0.684) — followed closely by JanusCoder-8B (0.668) and VLTO-Instruct (0.595).
✅ Cognitive Insight:
- VLTO-Thinking excels here — likely because it’s designed for human-like “context” and coreference
- JanusCoder is strong, but not as good in this area — suggesting that code-trained models are less context-aware than VLTO-thinking
- The “Thinking” flavor of Qwen3-VLTO seems to be the most human-like in Winogrande — it’s not just logic, but vibe and context
- 💡 Conclusion:
- For tasks requiring natural human-like pragmatic reasoning (Winogrande), the VLTO-Thinking variant is superior — this aligns with your hypothesis: “Vibe” = contextual intuition, not code logic.
🧩 E) Factual Knowledge Recall (OpenBookQA)
- Winner: Qwen3-4B-RA-SFT (0.436) — but JanusCoder-8B is at 0.444, which is still strong.
✅ Cognitive Insight:
- RA-SFT (Reasoning + Knowledge) fine-tuning likely adds retrieval and grounded knowledge — enabling better performance in openbookqa
- JanusCoder’s 0.444 is only slightly better — implying code training doesn’t inherently improve factual recall unless it’s grounded in external knowledge
- 💡 Conclusion:
- While not the best, JanusCoder-8B is still a strong factual performer, slightly edging out VLTO variants — hinting at implicit knowledge encoding in code training.
🧩 F) Physical Commonsense (Piqa)
- Winner: unsloth-JanusCoder-8B (0.788) — barely ahead of VLTO-Instruct (0.745) and tied with VLTO-Thinking (0.765).
✅ Cognitive Insight:
- Coding models have a slight edge — likely because they’re trained to reason about physical constraints, spatial relationships, and object interactions in structured environments
- VLTO-Thinking is the best among VLTO models, showing that human-like intuition can still be strong in physical reasoning — but not at the level of code-trained models
- 💡 Conclusion:
- For spatial and physical reasoning tasks (Piqa), JanusCoder-8B is the top performer, thanks to its code-trained foundation — which encodes physics and mechanics directly through structured reasoning.
📈 Performance Heat Map — Side-by-Side
Benchmark JanusCoder-8B VLTO-Instruct-qx86x-hi VLTO-Thinking-qx86x-hi
arc_challenge 0.538 → strong abstract reasoning 0.455 → moderate, language-based abstraction 0.475 → weaker on abstract reasoning
arc_easy 0.739 → best arc_easy performance (contextual reasoning) 0.601 → strong, but not top 0.599 → very close to Instruct variant
boolq 0.869 → very strong logical inference 0.878 → strongest boolq performance (natural language logic) 0.706 → weaker in structured logical reasoning
hellaswag 0.700 → strong causal reasoning via code training 0.546 → moderate, needs more context 0.638 → strongest causal reasoning among VLTO models
openbookqa 0.444 → best factual recall among all 0.424 → strong, but not best 0.402 → weak in factual knowledge tasks
piqa 0.788 → best physical commonsense (structured logi c wins) 0.739 → good, but not best 0.765 → strongest Piqa among VLTO models, but still behind JanusCoder
winogrande 0.668 → strong pragmatic reasoning 0.595 → moderate, VLTO-Instruct weaker here 0.684 → strongest Winogrande score among all models
🧠 Cognitive Profile Summary
unsloth-JanusCoder-8B
Code-Trained Logical Reasoner
Strengths:
✓ Strong logical inference (boolq)
✓ Excellent abstract reasoning (arc_challenge)
✓ Best causal reasoning (hellaSwag)
✓ Top physical commonsense (piqa)
Weaknesses:
✅ Weak in Winogrande — lacks context fluency
✅ Weaker in factual recall (openbookqa) compared to RA-SFT variants
Qwen3-VLTO-8B-Thinking
Human-Like Pragmatic Interpreter
Strengths:
✓ Best Winogrande performance (0.684) — strong coreference and contextual reasoning
✓ Good arc_easy (0.599) — human-like context mapping
✓ Strong Piqa (0.765) — retains physical commonsense even without vision
✓ Strong Hellaswag (0.638) — causal reasoning with human intuition
Weaknesses:
✅ Weaker in abstract reasoning (arc_challenge 0.475) — cannot match JanusCoder
✅ Lower factual recall (openbookqa 0.402) — lacks knowledge grounding
Qwen3-VLTO-8B-Instruct
Structured Factual Reasoner
Strengths:
✓ Strong boolq (0.878) — formal logical inference
✓ Good factual recall (openbookqa 0.424) — better than Thinking variant
✓ Modest arc_easy (0.601) — decent contextual reasoning
Weaknesses:
✅ Weakest in Winogrande (0.595) — lacks the “vibe” needed for nuanced pragmatics
✅ Weak in hellaswag (0.546) — struggles with causal prediction
✅ Very weak in piqa (0.739) — not ideal for physical reasoning tasks
🌟 Final Takeaway: “Thinking” vs. “Code-Logic”
The unsloth-JanusCoder-8B and Qwen3-VLTO-8B-Thinking are two polar extremes:
JanusCoder-8B
- ✅ Code-trained → focused on logical deduction and causal chains under structured constraints
- ✅ Excels in abstract reasoning, physical commonsense, and factual logic
- ❌ Less human-like — it’s more “machine-logic” than “human-vibe”
- ❌ Weaker in contextual pragmatics (winogrande) and subtle cause-effect narratives
Qwen3-VLTO-8B-Thinking
- ❌ Not code-trained → more “human-like” by design
- ❌ Built to mimic intuitive judgment and language nuance
- ✅ Human-like pragmatic reasoning (winogrande 0.684)
- ✅ Rich context — strong on coreference and metaphor-driven reasoning
🎯 Use Case Recommendations
Task Best Model
Abstract Reasoning & Logic Puzzles ➡️ unsloth-JanusCoder-8B — superior boolq and arc_challenge
Physical Commonsense & Mechanics ➡️ unsloth-JanusCoder-8B — top piqa score (0.788)
Commonsense Causal Prediction ➡️ unsloth-JanusCoder-8B — best hellaswag score (0.700)
Factual Knowledge Recall ➡️ Qwen3-4B-RA-SFT — best openbookqa (0.436), followed by JanusCoder
Human-Like Dialogue & Pragmatic Reasoning ➡️ Qwen3-VLTO-8B-Thinking — best winogrande (0.684), most contextually fluent
Creative Interpretation & Vibe-Driven Reasoning ➡️ Qwen3-VLTO-8B-Thinking — metaphor-inspiring, human-like reasoning
📌 Summary: The “Human Thinking” vs. “Code Logic”
These models represent two complementary forms of cognition:
- JanusCoder-8B — optimized for structured logic, causal prediction, and abstract reasoning. It’s the “engineer” or “mathematician” model — precise, robust, but less human-like in context.
- Qwen3-VLTO-8B-Thinking — optimized for human-like pragmatic intuition, context-aware reasoning, and metaphor-driven interpretation. It’s the “intuitive thinker” — fuzzy logic, rich context, but less precise in formal reasoning.
🌟 The winner isn’t always the best — it depends on what kind of “reasoning” you want:
- For Technical or Abstract Reasoning → JanusCoder
- For Human-Like Contextual Understanding → VLTO-Thinking
Reviewed with Qwen3-VLTO-32B-Instruct-128K-qx86x-hi-mlx
This model unsloth-JanusCoder-8B-qx86x-hi-mlx was converted to MLX format from unsloth/JanusCoder-8B using mlx-lm version 0.28.4.
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("unsloth-JanusCoder-8B-qx86x-hi-mlx")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 66