✂️ OpenProvence: An Open-Source Implementation of Efficient and Robust Context Pruning for Retrieval-Augmented Generation
⚡️ Lightweight Provence-style rerankers that keep the answers and drop the noise for retrieval-augmented generation.
OpenProvence follows the Provence approach to simultaneously prune irrelevant passages and produce a reranking score for question-answering workflows. Modern agents—DeepResearch loops, autonomous search pipelines, context engineering systems—tend to accumulate tangential paragraphs that inflate LLM token budgets. Drop an OpenProvence checkpoint in front of your LLM to extract only the passages that matter.
We provide open weights along with MIT-licensed training, inference, and dataset-construction tooling for reproducible workflows on commodity hardware.
Highlights
- Pruning power – Drop ~99% of off-topic sentences while still compressing 80–90% of relevant text; MLDR evaluations confirm the answers stay intact.
- Ship-ready checkpoints – Four bilingual models (30M–310M parameters) on Hugging Face under MIT; the 30M xsmall runs comfortably on CPU and screams on GPU.
- Reproducible training – Follow the training guide to train every checkpoint on a single ≥16 GB NVIDIA GPU.
- Dataset tooling – Build OpenProvence-format corpora from your own data with the dataset creation guide.
- Evaluation utilities – CLI runners for dataset retention sweeps and MLDR long-document benchmarks keep regression tracking straightforward.
- Documentation-first – End-to-end reports, guides, and configs cover training, evaluation, and dataset creation.
- Teacher model – A multilingual span annotator, query-context-pruner-multilingual-Qwen3-4B, powers custom label pipelines.
Model Line-up
Pick the checkpoint that matches your latency and language targets. All checkpoints are hosted on Hugging Face with permissive licensing.
| Model | Language | Hugging Face ID | Parameters | Notes |
|---|---|---|---|---|
| base | English & Japanese | hotchpotch/open-provence-reranker-v1 | 130M | Balanced accuracy vs. speed for bilingual workloads |
| xsmall | English & Japanese | hotchpotch/open-provence-reranker-xsmall-v1 | 30M | Fastest option; practical even without a GPU |
| large | English & Japanese | hotchpotch/open-provence-reranker-large-v1 | 310M | Highest compression at comparable F2 scores |
| en-gte | English | hotchpotch/open-provence-reranker-v1-gte-modernbert-base | 149M | English-only checkpoint with top reranking fidelity |
Quickstart
Install
uv pip install transformers torch tokenizers sentencepiece protobuf
uv pip install fast-bunkai nltk
uv run python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"
For CUDA hosts, you can optionally install flash-attention for faster inference.
Python Example
from transformers import AutoModel
# Swap to the checkpoint you want to run
model_name = "hotchpotch/open-provence-reranker-xsmall-v1"
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
)
question:str = "What's your favorite Japanese food?"
context:str = """
Work deadlines piled up today, and I kept rambling about budget spreadsheets to my roommate.
Next spring I'm planning a trip to Japan so I can wander Kyoto's markets and taste every regional dish I find.
Sushi is honestly my favourite—I want to grab a counter seat and let the chef serve endless nigiri until I'm smiling through soy sauce.
Later I remembered to water the plants and pay the electricity bill before finally getting some sleep.
"""
result = model.process(
question=question,
context=context,
threshold=0.1,
show_progress=True,
)
print("Pruned context:\\n" + result["pruned_context"])
print("Reranking score:", round(result["reranking_score"], 4))
print("Compression rate:", round(result["compression_rate"], 2))
OpenProvence checkpoints expose a single process method that accepts raw question/context strings, applies sentence-level pruning, and returns the reranking score alongside compression metrics.
process() accepts either a single query/context pair or batched inputs. Use question: str with context: str for one document, question: str with context: List[str] to prune multiple documents for the same query, or question: List[str] and context: List[str] to batch independent pairs. To feed pre-segmented sentences, pass context: List[List[str]]; each inner list is treated as already split and the built-in splitter is skipped.
⚠️ Shape matters: A lone string paired with a list of contexts is interpreted as one query with many documents. Make sure
questionandcontexthave matching shapes when batching to avoid truncated or duplicated outputs.
Key parameters you may want to tune:
question: str | Sequence[str]– Query text. Provide a list to batch multiple questions; each item pairs with the corresponding entry incontext.context: str | Sequence[str] | Sequence[Sequence[str]]– Contexts aligned to the query. Use a list for one document per query, or a list of lists to supply multiple documents (or pre-split sentences) for each query.title: str | Sequence[str] | Sequence[Sequence[str]] | None– Optional titles aligned to each context. The default sentinel"first_sentence"marks the opening sentence so you can keep it by pairing withalways_select_title=Trueorfirst_line_as_title=True; without those flags it is scored like any other sentence. SetNoneto disable all title handling.threshold: float(default0.1) – Pruning probability cutoff. Larger values discard more sentences;0.05–0.5works well across datasets.batch_size: int(default32) – Number of contexts processed per inference batch. Increase for throughput, decrease if you run out of memory.language: str | None– Built-in splitter selection ("auto","ja","en"). The default behaves like"auto"and detects Japanese vs. English automatically.reorder: boolandtop_k: int | None– Whenreorder=True, contexts are sorted by reranker score. Combine withtop_kto keep only the top-ranked documents.first_line_as_title: bool/always_select_title: bool– Extract the first non-empty line as a title and optionally guarantee the title sentence survives pruning.return_sentence_metrics: bool/return_sentence_texts: bool– Include per-sentence probabilities and kept/removed sentence lists in the output for analysis workflows.
Evaluation Summary
Detailed metrics live in the OpenProvence v1 Evaluation Report. Highlights below show MLDR at each model's best Has Answer threshold, plus cross-dataset means at the standard 0.10 threshold:
MLDR with LLM eval · English (best Has Answer per model)
Threshold (p) lists the pruning probability that produced the highest Has Answer score for each checkpoint.
| Model | Params | Threshold (p) | Has Answer (%) | Compression (pos) | Compression (neg) |
|---|---|---|---|---|---|
| none (original) | - | - | 93.68 | 0.00% | 0.00% |
| en-gte | 149M | 0.10 | 94.25 | 92.33% | 99.91% |
| xsmall | 30M | 0.05 | 93.68 | 82.18% | 99.18% |
| base | 130M | 0.05 | 93.68 | 90.05% | 99.62% |
| large | 310M | 0.10 | 93.10 | 94.38% | 99.90% |
| naver-provence | 305M | 0.10 | 93.10 | 94.00% | 99.50% |
Highlights: en-gte surpasses the original (no compression) baseline in Has Answer score while achieving over 92% compression on positive samples. The large model (310M) achieves performance comparable to the naver/provence baseline (305M) with similar parameter counts. Smaller models (xsmall and base) match the original baseline's accuracy with substantial compression benefits, though with slightly lower compression rates compared to the larger models.
MLDR with LLM eval · Japanese (best Has Answer per model)
Threshold (p) again shows the probability cutoff that maximised Has Answer for each model.
| Model | Params | Threshold (p) | Has Answer (%) | Compression (pos) | Compression (neg) |
|---|---|---|---|---|---|
| none (original) | - | - | 77.71 | 0.00% | 0.00% |
| xsmall | 30M | 0.05 | 81.93 | 76.46% | 96.11% |
| base | 130M | 0.05 | 83.13 | 80.98% | 97.89% |
| large | 310M | 0.10 | 79.52 | 87.89% | 98.82% |
Highlights: All models significantly outperform the original baseline on Japanese MLDR. base delivers the top Has Answer score (+5.42 points over original) while retaining strong compression. Even large, which prioritizes maximum compression (nearly 88% positive, 99% negative), exceeds the original baseline by 1.81 points.
Dataset Benchmarks · English (threshold = 0.10)
| Model | Mean F2 | Mean Compression | Mean Inference Time (s) |
|---|---|---|---|
| en-gte | 0.734 | 39.9% | 0.55 |
| xsmall | 0.696 | 33.8% | 0.34 |
| base | 0.737 | 39.9% | 0.69 |
| large | 0.749 | 41.7% | 1.04 |
Highlights: en-gte is the top English reranker at this threshold, while large gives the best compression with a modest latency bump. xsmall remains the latency leader.
Dataset Benchmarks · Japanese (threshold = 0.10)
| Model | Mean F2 | Mean Compression | Mean Inference Time (s) |
|---|---|---|---|
| xsmall | 0.727 | 53.2% | 0.32 |
| base | 0.768 | 57.4% | 1.06 |
| large | 0.783 | 59.1% | 1.69 |
Highlights: base and large deliver the strongest F2 on Japanese corpora, with large leading on compression. xsmall stays nimble for CPU-centric deployments.
Training Data
OpenProvence v1 checkpoints are distilled from multilingual QA corpora that were re-labeled with the Qwen3-4B teacher. English coverage spans hotchpotch/msmarco-context-relevance, hotchpotch/gooaq-context-relevance-130k, and hotchpotch/natural-questions-context-relevance. Japanese coverage comes from hotchpotch/japanese-context-relevance, which includes MS MARCO JA and native QA sources. All datasets expose sentence-span keep/drop labels plus teacher reranker scores, so you can reproduce or extend the mixture for your own domains.
Training Recipe
This model family was trained with the open-source OpenProvence stack and is reproducible on a single ≥16 GB NVIDIA GPU.
Teacher Label Generation (DeepSeek-V3)
Use DeepSeek-V3 to annotate question/context relevance, producing the multilingual 140k-sample dataset qa-context-relevance-multilingual-140k.Teacher Context-Relevance SFT (Qwen3-4B)
Fine-tune Qwen3-4B to build the multilingual teacher query-context-pruner-multilingual-Qwen3-4B, enabling fast, consistent span-level annotations.Context-Relevance Dataset Construction
Generate sentence-span labels and teacher scores from the following corpora:
English:- hotchpotch/msmarco-context-relevance
- hotchpotch/gooaq-context-relevance-130k
- hotchpotch/natural-questions-context-relevance
Japanese: - hotchpotch/japanese-context-relevance
Deduplicate near-identical negatives and follow the dataset creation guide for preprocessing tips.
Final Model Training
Distill existing reranker scores into a unified model that combines the cross-encoder reranker head with the context-pruning head. Reference docs/train.md for configuration details and baseline commands.
License
MIT License
Acknowledgements
- Provence: efficient and robust context pruning for retrieval-augmented generation inspired the overall approach. Huge thanks to the Naver Labs Europe authors for releasing both the paper and the naver/provence-reranker-debertav3-v1 checkpoint that validated how powerful Provence-style pruning can be.
- Sentence Transformers provided invaluable reference implementations for cross-encoder training that informed our pipelines.
Citation
If you use OpenProvence in your research, please cite it:
@misc{yuichi-tateno-2025-open-provence,
url = {https://github.com/hotchpotch/open_provence},
title = {OpenProvence: An Open-Source Implementation of Efficient and Robust Context Pruning for Retrieval-Augmented Generation},
author = {Yuichi Tateno},
year = {2025}
}
Authors
- Yuichi Tateno (@hotchpotch)
- Downloads last month
- 1,287
Model tree for hotchpotch/open-provence-reranker-xsmall-v1
Base model
sbintuitions/modernbert-ja-30m