tei-annotator / README.md
cmboulanger's picture
Upload folder using huggingface_hub
89b03df verified
metadata
title: TEI Annotator
emoji: 📝
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Annotate plain text with TEI XML tags using an LLM backend

A Python library for annotating plain text with TEI XML tags using a two-stage LLM pipeline.

  1. (Optional) GLiNER pre-detection — fast CPU-based span labelling generates candidates for the LLM to verify and extend.
  2. LLM annotation — a prompted language model identifies entities and returns structured spans (element, verbatim text, surrounding context, attributes).
  3. Deterministic post-processing — spans are resolved to character offsets, validated against the schema, and injected as XML tags. The source text is never modified by any model call.

Pipeline stages

  Input text
       │
       ▼  strip existing XML tags
       ▼  (optional) GLiNER pre-detection  ──→  tei_annotator/detection/
       ▼  chunk text                        ──→  tei_annotator/chunking/
       ▼  build LLM prompt                  ──→  tei_annotator/prompting/
       ▼  LLM inference                     ──→  tei_annotator/inference/
       ▼  parse JSON response               ──→  tei_annotator/postprocessing/
       ▼  resolve spans → char offsets
       ▼  validate against schema
       ▼  inject XML tags
       │
       ▼
  Annotated XML output

Stage documentation: Data models · GLiNER detection · Chunking · Prompt building · Inference configuration · Post-processing · Evaluation


Disclaimer: The code in this repository was generated by Claude (Anthropic) based on prompts and direction provided by @cboulanger.


Installation

Requires Python ≥ 3.12 and uv.

git clone <repo>
cd tei-annotator
uv sync                    # runtime deps: jinja2, lxml, rapidfuzz
uv sync --extra gliner     # also installs gliner for optional pre-detection

API keys for LLM endpoints go in .env (copy from .env.template).


Quick start

from tei_annotator import annotate, TEISchema, TEIElement, TEIAttribute
from tei_annotator import EndpointConfig, EndpointCapability

schema = TEISchema(
    rules=[
        "Emit a 'surname' span within every enclosing 'persName' span.",
    ],
    elements=[
        TEIElement(
            tag="persName",
            description="a person's name",
            attributes=[TEIAttribute(name="ref", description="authority URI")],
        ),
        TEIElement(tag="placeName", description="a geographical place name"),
    ],
)

def my_call_fn(prompt: str) -> str:
    ...  # any LLM: Anthropic, OpenAI, Gemini, Ollama, …

endpoint = EndpointConfig(
    capability=EndpointCapability.TEXT_GENERATION,
    call_fn=my_call_fn,
)

result = annotate(
    text="Marie Curie was born in Warsaw and later worked in Paris.",
    schema=schema,
    endpoint=endpoint,
    gliner_model=None,   # pass e.g. "numind/NuNER_Zero" to enable pre-detection
)
print(result.xml)
# <persName>Marie Curie</persName> was born in <placeName>Warsaw</placeName>
# and later worked in <placeName>Paris</placeName>.

For provider setup examples (Anthropic, OpenAI, Gemini, Ollama, vLLM) see tei_annotator/inference/README.md.


Built-in providers

Five connectors live in tei_annotator/providers/, enabled by setting the corresponding env var:

Provider Env var ID
HuggingFace Inference Router HF_TOKEN hf
Google Gemini GEMINI_API_KEY gemini
KISSKI academic cloud KISSKI_API_KEY kisski
OpenAI OPENAI_API_KEY openai
Anthropic Claude ANTHROPIC_API_KEY claude

Adding a new provider: create a module in tei_annotator/providers/, subclass Connector, add an instance to _ALL_CONNECTORS in __init__.py. See tei_annotator/providers/README.md.


Built-in schemas

Two annotation schemas are registered in tei_annotator/schemas/registry.py:

Key Task
bibl Tag internal fields of a bibliographic reference (author, title, date, …)
bibl-reference-segmenter Segment a reference list into <bibl> spans with optional <label>

Each schema ships with at least one gold-standard corpus file in data/corpus/<schema>.default.tei.xml used by the evaluator and webservice.

Adding a new schema: register it in SCHEMA_REGISTRY. See tei_annotator/schemas/README.md.


Evaluation and iterative improvement

scripts/evaluate_llm.py runs any available provider against a gold-standard TEI file:

# quick run: 5 records, gemini, bibl-reference-segmenter schema
uv run scripts/evaluate_llm.py \
    --provider gemini --schema bibl-reference-segmenter --max-items 5 --verbose

# all available providers, all records, output to file
uv run scripts/evaluate_llm.py --schema bibl --output-file results.txt

Key flags: --provider, --model, --schema, --gold-file, --max-items, --batch-size, --match-mode, --verbose, --grep, --shuffle.

scripts/collect_hard_examples.py builds a gold fixture of challenging examples by evaluating items in mini-batches and retaining those the model handles poorly:

# collect 30 hard bibl-reference-segmenter examples using KISSKI gemma-4-31b-it
uv run scripts/collect_hard_examples.py \
    --provider kisski --model gemma-4-31b-it \
    --limit 30 --batch-size 10 --f1-threshold 0.95 \
    --output data/hard-bibl-refseg-gemma.tei.xml

Key flags: --schema, --provider, --model, --limit, --batch-size, --f1-threshold, --max-per-batch, --context, --shuffle.

For the iterative schema-improvement workflow see docs/tei-element-descriptions.md. For metrics details see tei_annotator/evaluation/README.md.


Debugging annotation output

scripts/debug_annotation.py runs the annotation pipeline step-by-step on a single text snippet and prints every intermediate result — prompt, raw LLM response, parsed spans, resolver output, validation rejections, and the final XML. Useful for diagnosing why a particular record is annotated incorrectly.

# Annotate a text snippet (default: gemini / gemini-2.5-flash, bibl schema)
uv run scripts/debug_annotation.py --text "Bugnon (A.-L.), Le mobilier céramique, in Méloche 2012, p. 182-196."

# Read text from a file or stdin
uv run scripts/debug_annotation.py --file path/to/snippet.txt
echo "Curie 1911..." | uv run scripts/debug_annotation.py

# Different provider / model / schema
uv run scripts/debug_annotation.py --text "..." \
    --provider kisski --model Qwen3-235B-A22B \
    --schema bibl-reference-segmenter

# Print the full LLM prompt (suppressed by default — ~7 KB)
uv run scripts/debug_annotation.py --text "..." --show-prompt

The output walks through each pipeline stage with counts and rejection reasons:

STEP 1  Strip existing XML tags       (reports any tags stripped from input)
STEP 2  Chunking                      (chunk count, offsets)
STEP 3  Chunk N/N
  Prompt: 7648 chars                  (truncated preview; --show-prompt for full)
  Raw LLM response                    (full JSON as returned)
  Parsed spans: 8                     (SpanDescriptors with text + context)
  Resolved spans: 8/8                 (char offsets; rejected with reason)
  Validated spans: 8/8               (schema rejects with reason)
STEP 4  Deduplication & merge         (overlapping spans from chunking)
STEP 5  inject_xml
FINAL OUTPUT                          (annotated XML)

Demo and webservice


Testing

# Unit tests (fully mocked, < 0.5 s)
uv run pytest

# Integration tests (no model download needed)
uv run pytest --override-ini="addopts=" -m integration \
    tests/integration/test_pipeline_e2e.py -k "not real_gliner"

# Integration tests with real GLiNER model (~400 MB on first run)
uv run pytest --override-ini="addopts=" -m integration \
    tests/integration/test_gliner_detector.py \
    tests/integration/test_pipeline_e2e.py::test_pipeline_with_real_gliner