wimbert-space / PLAN.md
yhavinga's picture
Improve UX: dynamic tokens, better colors, live feedback
1734421
|
raw
history blame
8.5 kB

WimBERT Synth v0 — Hugging Face Space Plan

This plan describes a lightweight, reliable Space to demo the dual‑head multi‑label classifier (onderwerp + beleving) defined by wimbert-synth-v0/model.py, with labels from wimbert-synth-v0/label_names.json and licensing in wimbert-synth-v0/LICENSE (Apache‑2.0).

Goals

  • Input: a single Dutch “signaalbericht” (free‑text).
  • Output: per head (onderwerp, beleving), show probabilities for all labels:
    • Visual: color‑coded list/table where color intensity reflects probability.
    • Numeric: exact probability values (0–1) and top‑K summary.
    • “Predicted” set using an adjustable threshold (default 0.5).
  • UX: one‑click Predict button; optional “live” inference (after brief inactivity).
  • Portable, reproducible, and fast enough on CPU; optionally GPU‑ready.

Toolkit Choice

  • Gradio is the best fit for this demo on Spaces:
    • First‑class support on Hugging Face Spaces, minimal boilerplate (app.py).
    • Simple event model (button click, input change) and components for text, tabs, HTML, charts.
    • Easy to serve both a compact top‑K view and a full “all labels” view with custom styling.
    • No Streamlit server/page lifecycle complexities for this small, single‑page inference app.

Model + License

  • Model artifacts live in wimbert-synth-v0/ with Apache‑2.0 license (redistribution permitted with attribution). Use the exact LICENSE in the Space repo.
  • The model is large (~1.2 GB for model.safetensors). To keep the Space repo small and boot times predictable, prefer hosting the model as a separate Model repo on the Hub, then download/cache in the Space at runtime.
    • Recommended: publish a model repo, e.g. UWV/wimbert-synth-v0, containing:
      • model.safetensors, config.json, tokenizer files, dual_head_state.pt, label_names.json, model.py, README.md, LICENSE.
    • The Space loads via DualHeadModel.from_pretrained(<model_repo_or_local_dir>).

UX & Visualization

  • Input: gr.Textbox(label="Signaalbericht", lines=6, placeholder=...).
  • Controls:
    • Predict button (primary path).
    • Auto-run toggle to enable live inference: trigger after user stops typing for ~600–800 ms (using Gradio’s input event with debounce or a simple timer wrapper). If performance on CPU is borderline, keep off by default.
    • Threshold slider (0.0–1.0, default 0.5) to highlight predicted labels.
    • Top‑K slider (1–15, default 5) to size the summary.
  • Output: tabs per head and views:
    • Tab 1: “Samenvatting” → two columns for Onderwerp and Beleving, each listing Top‑K labels with probabilities.
    • Tab 2: “Alle labels” → scrollable, color‑coded tables (or HTML lists) for every label with exact probabilities.
    • Tab 3: “JSON/CSV” → exportable raw probabilities (dict of label → prob) + list of predicted labels at current threshold.
  • Color mapping:
    • Use a light‑to‑dark monochrome (e.g., blue/green) where intensity ∝ probability; add a subtle border for > threshold.
    • Ensure text contrast (AA) and include numbers to avoid relying on color alone (accessibility).

Space Layout

  • Repo root (Space):
    • app.py — Gradio app with UI + inference.
    • requirements.txt — runtime deps.
    • README.md — usage, model card link, privacy note.
    • LICENSE — Apache‑2.0 (from wimbert-synth-v0/LICENSE).
    • Optional: assets/ (logo), examples/ (preset texts), .gitattributes.
  • The model is not vendored into the Space to avoid 1.2 GB LFS; it’s pulled at startup via huggingface_hub.snapshot_download or from_pretrained on the Hub repo.

Dependencies

  • gradio>=4.0
  • transformers>=4.40
  • torch (CPU is fine; GPU preferred if available)
  • safetensors, huggingface_hub
  • Optional perf: accelerate (device placement), onnxruntime/optimum (future optimization)

Inference Design

  • Load once at Space start (global singleton). Warm up with a short dummy input.
  • Device: choose cuda if available, else CPU. Cast to float16 on GPU; keep float32 on CPU.
  • Tokenization: use max_length from dual_head_state.pt config; allow truncation; optionally expose a compact/fast mode (e.g., cap at 512) if CPU latency needs improvement.
  • Output structures:
    • Dicts for each head: [ {label, prob, predicted} ... ] with predicted = prob >= threshold.
    • Top‑K lists derived from the sorted full list.
  • Visualization adapters render the above into: HTML tables (for color‑coding), and JSON/CSV text.

Event Flow

  1. User edits text.
  2. If Auto‑run enabled, debounce and run; else wait for Predict button.
  3. Tokenize → model.predict → probs (two tensors).
  4. Sort, slice to Top‑K summary and prepare full tables.
  5. Render to tabs and compact “Predicted labels” chips (one line per head).

Pseudocode Sketch (app.py)

import gradio as gr
import torch, json, importlib.util
from huggingface_hub import snapshot_download

MODEL_REPO = "UWV/wimbert-synth-v0"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Download/copy model folder and import DualHeadModel
model_dir = snapshot_download(MODEL_REPO)
spec = importlib.util.spec_from_file_location("model", f"{model_dir}/model.py")
model_mod = importlib.util.module_from_spec(spec); spec.loader.exec_module(model_mod)
DualHeadModel = model_mod.DualHeadModel
model, tokenizer, cfg = DualHeadModel.from_pretrained(model_dir, device=DEVICE)

# Warm-up
_ = model.predict(*tokenizer("Hoi", return_tensors="pt", padding="max_length", max_length=cfg["max_length"]).values())

def predict(text, threshold, topk):
    enc = tokenizer(text or "", truncation=True, padding="max_length", max_length=cfg["max_length"], return_tensors="pt")
    on_p, be_p = model.predict(enc["input_ids"].to(DEVICE), enc["attention_mask"].to(DEVICE))
    # Convert to python lists and build views ...
    return topk_view, all_labels_html, json_text

with gr.Blocks(title="WimBERT Synth v0") as demo:
    # Inputs, controls, tabs, outputs ...
    ...

if __name__ == "__main__":
    demo.launch()

Performance Notes

  • CPU on free Spaces will work but can be slow for long texts (base mmBERT at max_length≈1408). Mitigations:
    • Warm‑up once; cap max length to 512 in a “fast mode” toggle; show spinner while running.
    • Prefer a small GPU (T4 small) if available; cast to fp16 on GPU.
  • Caching: snapshot_download uses the shared cache; subsequent restarts are faster.

Privacy & Safety

  • The Space processes user text in memory only; no logging beyond Gradio defaults. Mention this in the Space README.
  • Include a “Use responsibly” note (analytics/routing aid; no automated decisions) mirroring the model card.

Deliverables

  • app.py with:
    • Robust model loading (Hub), device selection, warm‑up.
    • Predict function returning: top‑K per head, full colored table, JSON dump.
    • UI: textbox, Predict button, Auto‑run toggle (debounced), threshold & Top‑K sliders, tabs per view.
    • Example(s) from the model card (widget example) via gr.Examples.
  • requirements.txt (gradio, transformers, torch, huggingface_hub, safetensors).
  • README.md with screenshots, hardware recommendation, and links to the model card.
  • LICENSE copied from wimbert-synth-v0/LICENSE.

Step‑By‑Step

  1. Publish/verify model on Hub (UWV/wimbert-synth-v0), including model.py and license.
  2. Create Space repo with SDK=Gradio and pick hardware (CPU → OK; GPU → faster).
  3. Add Space files (app.py, requirements.txt, README.md, LICENSE).
  4. Implement and test inference locally (CPU) with a few sample texts; tune debounce/threshold defaults.
  5. Push Space; verify cold‑start time and inference latency; adjust max_length and hardware if needed.
  6. Polish visuals (colors, fonts, accessibility), add screenshots, and publish.

Nice‑To‑Haves (Later)

  • Per‑class thresholds (if you decide to introduce learned or tuned thresholds).
  • ONNX/Optimum path for CPU acceleration.
  • Session‑level analytics (aggregate latency, not storing user text).
  • Download CSV/JSON of the current result.
  • Translations for UI labels (NL/EN toggle).
Summary: Use Gradio for a single‑page Space that downloads the Apache‑licensed model from the Hub, offers both button‑based and debounced live inference, and presents per‑head probabilities as color‑coded tables with numeric values, plus top‑K and JSON outputs.