@ManniX-ITA on Hugging Face: "v1.1.0 was Claude + Ollama chat. Eight releases later the stack is a grounded…"

Post

283

v1.1.0 was Claude + Ollama chat. Eight releases later the stack is a grounded research pipeline plus a local-first memory layer; the token crunch is operational now, not a quality wall.

🚀 claude-hooks v1.8.3 — highlights since v1.1.0.

🧠 /consultants v2 — agentic council, matured.
🛠 tool_executor — PLAN→REPORT lane runs read_file / grep / glob over the codebase before the researcher speaks; claims grounded in tool output, not vibes.
✍️─ coder — sandboxed write_file role with per-language model routing (50KB/file, 1MB/lane caps).
🛡️ CitationLinter — three-layer verifier at the researcher boundary; every path:line claim checked against an mtime-cached code_graph. Catches fabricated filenames before they launder through critics + synthesizer.

💾 M14 cross-session memory (default on).
LangGraph BaseStore wired across four namespaces: research / tool_results / project / user. Per-namespace TTL: research=30d, tool_results=24h, project+user=forever. Hourly Caliber-style distillation reaper summarizes
expiring research into the durable project namespace BEFORE deletion — episodic → semantic, like human consolidation. Originals only dropped after a successful summary write.

🔁 sqlite_vec — full pgvector parity (v1.7).
Hybrid recall via RRF over vector cosine + BM25 (FTS5). KG surface: kg_create_entities / kg_add_observations / kg_create_relations / kg_search_nodes. Bundled sqlite-vec-mcp launcher went 3→8 tools so Cursor / Codex /
OpenWebUI / Claude Desktop share the same .db. Lazy schema migration carries v1.6.x dbs in place, non-destructive.

🧩 llamafile chat + embed (v1.4 + v1.5).
HyDE / reflect / consolidate / get-advice / consultants route to a daemon-supervised local llamafile via the llamafile:// model prefix. Multi-instance LRU, per-label idle reap, sticky CPU fallback. Stack runs
offline now.

🐧 Linux / macOS / Windows. PostgreSQL OR SQLite. Local OR cloud LLMs.

🔗 github.com/mann1x/claude-hooks

The v1.1.0 post had a single "one model wins all" PROD line; v1.8 has per-axis baselines from three live cohorts (May 16-17 '26).

✍️ /consultants coder — per-language routing (mlang v1.0.1, suite hash ddef8095, 6 langs × 13 questions).

Lang	Primary	Fallback
c	glm-5.1:cloud	deepseek-v4-pro:cloud
cpp	deepseek-v4-flash:cloud	kimi-k2.6:cloud
csharp	deepseek-v4-pro:cloud	kimi-k2.6:cloud
go	kimi-k2.6:cloud	deepseek-v4-pro:cloud
python	glm-5.1:cloud	kimi-k2.6:cloud
rust	deepseek-v4-flash:cloud	deepseek-v4-pro:cloud

No single model wins across languages. Out-of-cohort langs (ts/java/ruby/swift/shell) fall to global default glm-5.1:cloud → kimi-k2.6:cloud.

🛠 tool_executor — 48 trials, suite 7921555c:
• gemma4:31b-cloud — 87.5% / Q=5.00 ← winner (tiebreak: quality → wall → call count)
• kimi-k2.6 / deepseek-v4-pro / glm-5.1 — all 87.5% but lost tiebreakers
• gemini-3-flash-preview — 75% (qualifies)
• qwen3-coder-next — 62.5%, DISQUALIFIED. The Python coder winner is NOT a good tool_executor — "best at writing code" ≠ "best at mechanical tool chains for reading code".

⏱ Stall thresholds (M11a, suite c8306c62):
• kimi-k2.6:cloud cold TTFT = 150s → needs stall=390s. The global 300s default would falsely STARTUP_STALL it.
• qwen3-coder-next fastest startup (390ms).
• deepseek-v4-flash slowest p99 wall (255s) → hard_cap raised to 780s.

🧮 Rubric: pass_rate ≥ 70% AND avg_quality ≥ 3.5. Baselines append-only at docs/consultants-skill-eval-baselines.md. Re-bench any new model with claude-consultants skill-eval <suite> --live --accept-cost.

🔗 Bench dirs: benchmarks/consultants/results/2026-05-17/

Join the conversation