ManniX PRO
AI & ML interests
Recent Activity
Organizations
ManniX-ITA/gemma-4-31b-he1-it-NVFP4A16
The v1.1.0 post had a single "one model wins all" PROD line; v1.8 has per-axis baselines from three live cohorts (May 16-17 '26).
✍️ /consultants coder — per-language routing (mlang v1.0.1, suite hash ddef8095, 6 langs × 13 questions).
| Lang | Primary | Fallback |
|---|---|---|
| c | glm-5.1:cloud | deepseek-v4-pro:cloud |
| cpp | deepseek-v4-flash:cloud | kimi-k2.6:cloud |
| csharp | deepseek-v4-pro:cloud | kimi-k2.6:cloud |
| go | kimi-k2.6:cloud | deepseek-v4-pro:cloud |
| python | glm-5.1:cloud | kimi-k2.6:cloud |
| rust | deepseek-v4-flash:cloud | deepseek-v4-pro:cloud |
No single model wins across languages. Out-of-cohort langs (ts/java/ruby/swift/shell) fall to global default glm-5.1:cloud → kimi-k2.6:cloud.
🛠 tool_executor — 48 trials, suite 7921555c:
• gemma4:31b-cloud — 87.5% / Q=5.00 ← winner (tiebreak: quality → wall → call count)
• kimi-k2.6 / deepseek-v4-pro / glm-5.1 — all 87.5% but lost tiebreakers
• gemini-3-flash-preview — 75% (qualifies)
• qwen3-coder-next — 62.5%, DISQUALIFIED. The Python coder winner is NOT a good tool_executor — "best at writing code" ≠ "best at mechanical tool chains for reading code".
⏱ Stall thresholds (M11a, suite c8306c62):
• kimi-k2.6:cloud cold TTFT = 150s → needs stall=390s. The global 300s default would falsely STARTUP_STALL it.
• qwen3-coder-next fastest startup (390ms).
• deepseek-v4-flash slowest p99 wall (255s) → hard_cap raised to 780s.
🧮 Rubric: pass_rate ≥ 70% AND avg_quality ≥ 3.5. Baselines append-only at docs/consultants-skill-eval-baselines.md. Re-bench any new model with claude-consultants skill-eval <suite> --live --accept-cost.
🔗 Bench dirs: benchmarks/consultants/results/2026-05-17/
🚀 claude-hooks v1.8.3 — highlights since v1.1.0.
🧠 /consultants v2 — agentic council, matured.
🛠 tool_executor — PLAN→REPORT lane runs read_file / grep / glob over the codebase before the researcher speaks; claims grounded in tool output, not vibes.
✍️─ coder — sandboxed write_file role with per-language model routing (50KB/file, 1MB/lane caps).
🛡️ CitationLinter — three-layer verifier at the researcher boundary; every
path:line claim checked against an mtime-cached code_graph. Catches fabricated filenames before they launder through critics + synthesizer.💾 M14 cross-session memory (default on).
LangGraph BaseStore wired across four namespaces: research / tool_results / project / user. Per-namespace TTL: research=30d, tool_results=24h, project+user=forever. Hourly Caliber-style distillation reaper summarizes
expiring research into the durable project namespace BEFORE deletion — episodic → semantic, like human consolidation. Originals only dropped after a successful summary write.
🔁 sqlite_vec — full pgvector parity (v1.7).
Hybrid recall via RRF over vector cosine + BM25 (FTS5). KG surface: kg_create_entities / kg_add_observations / kg_create_relations / kg_search_nodes. Bundled sqlite-vec-mcp launcher went 3→8 tools so Cursor / Codex /
OpenWebUI / Claude Desktop share the same .db. Lazy schema migration carries v1.6.x dbs in place, non-destructive.
🧩 llamafile chat + embed (v1.4 + v1.5).
HyDE / reflect / consolidate / get-advice / consultants route to a daemon-supervised local llamafile via the
llamafile:// model prefix. Multi-instance LRU, per-label idle reap, sticky CPU fallback. Stack runsoffline now.
🐧 Linux / macOS / Windows. PostgreSQL OR SQLite. Local OR cloud LLMs.
🔗 github.com/mann1x/claude-hooks