Spaces:
Sleeping
title: TEI Annotator
emoji: 📝
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Annotate plain text with TEI XML tags using an LLM backend
A Python library for annotating plain text with TEI XML tags using a two-stage LLM pipeline.
- (Optional) GLiNER pre-detection — fast CPU-based span labelling generates candidates for the LLM to verify and extend.
- LLM annotation — a prompted language model identifies entities and returns structured spans (element, verbatim text, surrounding context, attributes).
- Deterministic post-processing — spans are resolved to character offsets, validated against the schema, and injected as XML tags. The source text is never modified by any model call.
Pipeline stages
Input text
│
▼ strip existing XML tags
▼ (optional) GLiNER pre-detection ──→ tei_annotator/detection/
▼ chunk text ──→ tei_annotator/chunking/
▼ build LLM prompt ──→ tei_annotator/prompting/
▼ LLM inference ──→ tei_annotator/inference/
▼ parse JSON response ──→ tei_annotator/postprocessing/
▼ resolve spans → char offsets
▼ validate against schema
▼ inject XML tags
│
▼
Annotated XML output
Stage documentation: Data models · GLiNER detection · Chunking · Prompt building · Inference configuration · Post-processing · Evaluation
Disclaimer: The code in this repository was generated by Claude (Anthropic) based on prompts and direction provided by @cboulanger.
Installation
Requires Python ≥ 3.12 and uv.
git clone <repo>
cd tei-annotator
uv sync # runtime deps: jinja2, lxml, rapidfuzz
uv sync --extra gliner # also installs gliner for optional pre-detection
API keys for LLM endpoints go in .env (copy from .env.template).
Quick start
from tei_annotator import annotate, TEISchema, TEIElement, TEIAttribute
from tei_annotator import EndpointConfig, EndpointCapability
schema = TEISchema(
rules=[
"Emit a 'surname' span within every enclosing 'persName' span.",
],
elements=[
TEIElement(
tag="persName",
description="a person's name",
attributes=[TEIAttribute(name="ref", description="authority URI")],
),
TEIElement(tag="placeName", description="a geographical place name"),
],
)
def my_call_fn(prompt: str) -> str:
... # any LLM: Anthropic, OpenAI, Gemini, Ollama, …
endpoint = EndpointConfig(
capability=EndpointCapability.TEXT_GENERATION,
call_fn=my_call_fn,
)
result = annotate(
text="Marie Curie was born in Warsaw and later worked in Paris.",
schema=schema,
endpoint=endpoint,
gliner_model=None, # pass e.g. "numind/NuNER_Zero" to enable pre-detection
)
print(result.xml)
# <persName>Marie Curie</persName> was born in <placeName>Warsaw</placeName>
# and later worked in <placeName>Paris</placeName>.
For provider setup examples (Anthropic, OpenAI, Gemini, Ollama, vLLM) see tei_annotator/inference/README.md.
Built-in providers
Five connectors live in tei_annotator/providers/, enabled by setting the corresponding env var:
| Provider | Env var | ID |
|---|---|---|
| HuggingFace Inference Router | HF_TOKEN |
hf |
| Google Gemini | GEMINI_API_KEY |
gemini |
| KISSKI academic cloud | KISSKI_API_KEY |
kisski |
| OpenAI | OPENAI_API_KEY |
openai |
| Anthropic Claude | ANTHROPIC_API_KEY |
claude |
Adding a new provider: create a module in tei_annotator/providers/, subclass Connector, add an instance to _ALL_CONNECTORS in __init__.py. See tei_annotator/providers/README.md.
Built-in schemas
Two annotation schemas are registered in tei_annotator/schemas/registry.py:
| Key | Task |
|---|---|
bibl |
Tag internal fields of a bibliographic reference (author, title, date, …) |
bibl-reference-segmenter |
Segment a reference list into <bibl> spans with optional <label> |
Each schema ships with at least one gold-standard corpus file in data/corpus/<schema>.default.tei.xml used by the evaluator and webservice.
Adding a new schema: register it in SCHEMA_REGISTRY. See tei_annotator/schemas/README.md.
Evaluation and iterative improvement
scripts/evaluate_llm.py runs any available provider against a gold-standard TEI file:
# quick run: 5 records, gemini, bibl-reference-segmenter schema
uv run scripts/evaluate_llm.py \
--provider gemini --schema bibl-reference-segmenter --max-items 5 --verbose
# all available providers, all records, output to file
uv run scripts/evaluate_llm.py --schema bibl --output-file results.txt
Key flags: --provider, --model, --schema, --gold-file, --max-items, --batch-size, --match-mode, --verbose, --grep, --shuffle.
scripts/collect_hard_examples.py builds a gold fixture of challenging examples by evaluating items in mini-batches and retaining those the model handles poorly:
# collect 30 hard bibl-reference-segmenter examples using KISSKI gemma-4-31b-it
uv run scripts/collect_hard_examples.py \
--provider kisski --model gemma-4-31b-it \
--limit 30 --batch-size 10 --f1-threshold 0.95 \
--output data/hard-bibl-refseg-gemma.tei.xml
Key flags: --schema, --provider, --model, --limit, --batch-size, --f1-threshold, --max-per-batch, --context, --shuffle.
For the iterative schema-improvement workflow see docs/tei-element-descriptions.md. For metrics details see tei_annotator/evaluation/README.md.
Debugging annotation output
scripts/debug_annotation.py runs the annotation pipeline step-by-step on a
single text snippet and prints every intermediate result — prompt, raw LLM
response, parsed spans, resolver output, validation rejections, and the final XML.
Useful for diagnosing why a particular record is annotated incorrectly.
# Annotate a text snippet (default: gemini / gemini-2.5-flash, bibl schema)
uv run scripts/debug_annotation.py --text "Bugnon (A.-L.), Le mobilier céramique, in Méloche 2012, p. 182-196."
# Read text from a file or stdin
uv run scripts/debug_annotation.py --file path/to/snippet.txt
echo "Curie 1911..." | uv run scripts/debug_annotation.py
# Different provider / model / schema
uv run scripts/debug_annotation.py --text "..." \
--provider kisski --model Qwen3-235B-A22B \
--schema bibl-reference-segmenter
# Print the full LLM prompt (suppressed by default — ~7 KB)
uv run scripts/debug_annotation.py --text "..." --show-prompt
The output walks through each pipeline stage with counts and rejection reasons:
STEP 1 Strip existing XML tags (reports any tags stripped from input)
STEP 2 Chunking (chunk count, offsets)
STEP 3 Chunk N/N
Prompt: 7648 chars (truncated preview; --show-prompt for full)
Raw LLM response (full JSON as returned)
Parsed spans: 8 (SpanDescriptors with text + context)
Resolved spans: 8/8 (char offsets; rejected with reason)
Validated spans: 8/8 (schema rejects with reason)
STEP 4 Deduplication & merge (overlapping spans from chunking)
STEP 5 inject_xml
FINAL OUTPUT (annotated XML)
Demo and webservice
- HuggingFace demo: https://huggingface.co/spaces/cmboulanger/tei-annotator
webservice/— FastAPI JSON API + browser UI, all five providers. See webservice/README.md.
Testing
# Unit tests (fully mocked, < 0.5 s)
uv run pytest
# Integration tests (no model download needed)
uv run pytest --override-ini="addopts=" -m integration \
tests/integration/test_pipeline_e2e.py -k "not real_gliner"
# Integration tests with real GLiNER model (~400 MB on first run)
uv run pytest --override-ini="addopts=" -m integration \
tests/integration/test_gliner_detector.py \
tests/integration/test_pipeline_e2e.py::test_pipeline_with_real_gliner