--- title: TEI Annotator emoji: 📝 colorFrom: blue colorTo: green sdk: docker app_port: 7860 pinned: false license: mit short_description: Annotate plain text with TEI XML tags using an LLM backend --- A Python library for annotating plain text with [TEI XML](https://tei-c.org/) tags using a two-stage LLM pipeline. 1. **(Optional) GLiNER pre-detection** — fast CPU-based span labelling generates candidates for the LLM to verify and extend. 2. **LLM annotation** — a prompted language model identifies entities and returns structured spans (element, verbatim text, surrounding context, attributes). 3. **Deterministic post-processing** — spans are resolved to character offsets, validated against the schema, and injected as XML tags. The source text is **never modified** by any model call. --- ## Pipeline stages ```text Input text │ ▼ strip existing XML tags ▼ (optional) GLiNER pre-detection ──→ tei_annotator/detection/ ▼ chunk text ──→ tei_annotator/chunking/ ▼ build LLM prompt ──→ tei_annotator/prompting/ ▼ LLM inference ──→ tei_annotator/inference/ ▼ parse JSON response ──→ tei_annotator/postprocessing/ ▼ resolve spans → char offsets ▼ validate against schema ▼ inject XML tags │ ▼ Annotated XML output ``` Stage documentation: [Data models](tei_annotator/models/README.md) · [GLiNER detection](tei_annotator/detection/README.md) · [Chunking](tei_annotator/chunking/README.md) · [Prompt building](tei_annotator/prompting/README.md) · [Inference configuration](tei_annotator/inference/README.md) · [Post-processing](tei_annotator/postprocessing/README.md) · [Evaluation](tei_annotator/evaluation/README.md) --- > **Disclaimer:** The code in this repository was generated by [Claude](https://claude.ai) (Anthropic) based on prompts and direction provided by [@cboulanger](https://github.com/cboulanger). --- ## Installation Requires Python ≥ 3.12 and [uv](https://docs.astral.sh/uv/). ```bash git clone cd tei-annotator uv sync # runtime deps: jinja2, lxml, rapidfuzz uv sync --extra gliner # also installs gliner for optional pre-detection ``` API keys for LLM endpoints go in `.env` (copy from `.env.template`). --- ## Quick start ```python from tei_annotator import annotate, TEISchema, TEIElement, TEIAttribute from tei_annotator import EndpointConfig, EndpointCapability schema = TEISchema( rules=[ "Emit a 'surname' span within every enclosing 'persName' span.", ], elements=[ TEIElement( tag="persName", description="a person's name", attributes=[TEIAttribute(name="ref", description="authority URI")], ), TEIElement(tag="placeName", description="a geographical place name"), ], ) def my_call_fn(prompt: str) -> str: ... # any LLM: Anthropic, OpenAI, Gemini, Ollama, … endpoint = EndpointConfig( capability=EndpointCapability.TEXT_GENERATION, call_fn=my_call_fn, ) result = annotate( text="Marie Curie was born in Warsaw and later worked in Paris.", schema=schema, endpoint=endpoint, gliner_model=None, # pass e.g. "numind/NuNER_Zero" to enable pre-detection ) print(result.xml) # Marie Curie was born in Warsaw # and later worked in Paris. ``` For provider setup examples (Anthropic, OpenAI, Gemini, Ollama, vLLM) see [tei_annotator/inference/README.md](tei_annotator/inference/README.md). --- ## Built-in providers Five connectors live in [`tei_annotator/providers/`](tei_annotator/providers/), enabled by setting the corresponding env var: | Provider | Env var | ID | | --- | --- | --- | | HuggingFace Inference Router | `HF_TOKEN` | `hf` | | Google Gemini | `GEMINI_API_KEY` | `gemini` | | KISSKI academic cloud | `KISSKI_API_KEY` | `kisski` | | OpenAI | `OPENAI_API_KEY` | `openai` | | Anthropic Claude | `ANTHROPIC_API_KEY` | `claude` | Adding a new provider: create a module in `tei_annotator/providers/`, subclass `Connector`, add an instance to `_ALL_CONNECTORS` in `__init__.py`. See [tei_annotator/providers/README.md](tei_annotator/providers/README.md). --- ## Built-in schemas Two annotation schemas are registered in [`tei_annotator/schemas/registry.py`](tei_annotator/schemas/registry.py): | Key | Task | | --- | --- | | `bibl` | Tag internal fields of a bibliographic reference (author, title, date, …) | | `bibl-reference-segmenter` | Segment a reference list into `` spans with optional `