✂️ OpenProvence: An Open-Source Implementation of Efficient and Robust Context Pruning for Retrieval-Augmented Generation

⚡️ Lightweight Provence-style rerankers that keep the answers and drop the noise for retrieval-augmented generation.

OpenProvence follows the Provence approach to simultaneously prune irrelevant passages and produce a reranking score for question-answering workflows. Modern agents—DeepResearch loops, autonomous search pipelines, context engineering systems—tend to accumulate tangential paragraphs that inflate LLM token budgets. Drop an OpenProvence checkpoint in front of your LLM to extract only the passages that matter.

We provide open weights along with MIT-licensed training, inference, and dataset-construction tooling for reproducible workflows on commodity hardware.

Highlights

Pruning power – Drop ~99% of off-topic sentences while still compressing 80–90% of relevant text; MLDR evaluations confirm the answers stay intact.
Ship-ready checkpoints – Four bilingual models (30M–310M parameters) on Hugging Face under MIT; the 30M xsmall runs comfortably on CPU and screams on GPU.
Reproducible training – Follow the training guide to train every checkpoint on a single ≥16 GB NVIDIA GPU.
Dataset tooling – Build OpenProvence-format corpora from your own data with the dataset creation guide.
Evaluation utilities – CLI runners for dataset retention sweeps and MLDR long-document benchmarks keep regression tracking straightforward.
Documentation-first – End-to-end reports, guides, and configs cover training, evaluation, and dataset creation.
Teacher model – A multilingual span annotator, query-context-pruner-multilingual-Qwen3-4B, powers custom label pipelines.

Model Line-up

Pick the checkpoint that matches your latency and language targets. All checkpoints are hosted on Hugging Face with permissive licensing.

Model	Language	Hugging Face ID	Parameters	Notes
base	English & Japanese	hotchpotch/open-provence-reranker-v1	130M	Balanced accuracy vs. speed for bilingual workloads
xsmall	English & Japanese	hotchpotch/open-provence-reranker-xsmall-v1	30M	Fastest option; practical even without a GPU
large	English & Japanese	hotchpotch/open-provence-reranker-large-v1	310M	Highest compression at comparable F2 scores
en-gte	English	hotchpotch/open-provence-reranker-v1-gte-modernbert-base	149M	English-only checkpoint with top reranking fidelity

Quickstart

Install

uv pip install transformers torch tokenizers sentencepiece protobuf
uv pip install fast-bunkai nltk
uv run python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"

For CUDA hosts, you can optionally install flash-attention for faster inference.

Python Example

from transformers import AutoModel

# Swap to the checkpoint you want to run
model_name = "hotchpotch/open-provence-reranker-xsmall-v1"
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
)

question:str = "What's your favorite Japanese food?"
context:str = """
Work deadlines piled up today, and I kept rambling about budget spreadsheets to my roommate.
Next spring I'm planning a trip to Japan so I can wander Kyoto's markets and taste every regional dish I find.
Sushi is honestly my favourite—I want to grab a counter seat and let the chef serve endless nigiri until I'm smiling through soy sauce.
Later I remembered to water the plants and pay the electricity bill before finally getting some sleep.
"""

result = model.process(
    question=question,
    context=context,
    threshold=0.1,
    show_progress=True,
)

print("Pruned context:\\n" + result["pruned_context"])
print("Reranking score:", round(result["reranking_score"], 4))
print("Compression rate:", round(result["compression_rate"], 2))

OpenProvence checkpoints expose a single process method that accepts raw question/context strings, applies sentence-level pruning, and returns the reranking score alongside compression metrics.

process() accepts either a single query/context pair or batched inputs. Use question: str with context: str for one document, question: str with context: List[str] to prune multiple documents for the same query, or question: List[str] and context: List[str] to batch independent pairs. To feed pre-segmented sentences, pass context: List[List[str]]; each inner list is treated as already split and the built-in splitter is skipped.

⚠️ Shape matters: A lone string paired with a list of contexts is interpreted as one query with many documents. Make sure question and context have matching shapes when batching to avoid truncated or duplicated outputs.

Key parameters you may want to tune:

question: str | Sequence[str] – Query text. Provide a list to batch multiple questions; each item pairs with the corresponding entry in context.
context: str | Sequence[str] | Sequence[Sequence[str]] – Contexts aligned to the query. Use a list for one document per query, or a list of lists to supply multiple documents (or pre-split sentences) for each query.
title: str | Sequence[str] | Sequence[Sequence[str]] | None – Optional titles aligned to each context. The default sentinel "first_sentence" marks the opening sentence so you can keep it by pairing with always_select_title=True or first_line_as_title=True; without those flags it is scored like any other sentence. Set None to disable all title handling.
threshold: float (default 0.1) – Pruning probability cutoff. Larger values discard more sentences; 0.05–0.5 works well across datasets.
batch_size: int (default 32) – Number of contexts processed per inference batch. Increase for throughput, decrease if you run out of memory.
language: str | None – Built-in splitter selection ("auto", "ja", "en"). The default behaves like "auto" and detects Japanese vs. English automatically.
reorder: bool and top_k: int | None – When reorder=True, contexts are sorted by reranker score. Combine with top_k to keep only the top-ranked documents.
first_line_as_title: bool / always_select_title: bool – Extract the first non-empty line as a title and optionally guarantee the title sentence survives pruning.
return_sentence_metrics: bool / return_sentence_texts: bool – Include per-sentence probabilities and kept/removed sentence lists in the output for analysis workflows.

Evaluation Summary

Detailed metrics live in the OpenProvence v1 Evaluation Report. Highlights below show MLDR at each model's best Has Answer threshold, plus cross-dataset means at the standard 0.10 threshold:

MLDR with LLM eval · English (best Has Answer per model)

Threshold (p) lists the pruning probability that produced the highest Has Answer score for each checkpoint.

Model	Params	Threshold (p)	Has Answer (%)	Compression (pos)	Compression (neg)
none (original)	-	-	93.68	0.00%	0.00%
en-gte	149M	0.10	94.25	92.33%	99.91%
xsmall	30M	0.05	93.68	82.18%	99.18%
base	130M	0.05	93.68	90.05%	99.62%
large	310M	0.10	93.10	94.38%	99.90%
naver-provence	305M	0.10	93.10	94.00%	99.50%

Highlights: en-gte surpasses the original (no compression) baseline in Has Answer score while achieving over 92% compression on positive samples. The large model (310M) achieves performance comparable to the naver/provence baseline (305M) with similar parameter counts. Smaller models (xsmall and base) match the original baseline's accuracy with substantial compression benefits, though with slightly lower compression rates compared to the larger models.

MLDR　with LLM eval · Japanese (best Has Answer per model)

Threshold (p) again shows the probability cutoff that maximised Has Answer for each model.

Model	Params	Threshold (p)	Has Answer (%)	Compression (pos)	Compression (neg)
none (original)	-	-	77.71	0.00%	0.00%
xsmall	30M	0.05	81.93	76.46%	96.11%
base	130M	0.05	83.13	80.98%	97.89%
large	310M	0.10	79.52	87.89%	98.82%

Highlights: All models significantly outperform the original baseline on Japanese MLDR. base delivers the top Has Answer score (+5.42 points over original) while retaining strong compression. Even large, which prioritizes maximum compression (nearly 88% positive, 99% negative), exceeds the original baseline by 1.81 points.

Dataset Benchmarks · English (threshold = 0.10)

Model	Mean F2	Mean Compression	Mean Inference Time (s)
en-gte	0.734	39.9%	0.55
xsmall	0.696	33.8%	0.34
base	0.737	39.9%	0.69
large	0.749	41.7%	1.04

Highlights: en-gte is the top English reranker at this threshold, while large gives the best compression with a modest latency bump. xsmall remains the latency leader.

Dataset Benchmarks · Japanese (threshold = 0.10)

Model	Mean F2	Mean Compression	Mean Inference Time (s)
xsmall	0.727	53.2%	0.32
base	0.768	57.4%	1.06
large	0.783	59.1%	1.69

Highlights: base and large deliver the strongest F2 on Japanese corpora, with large leading on compression. xsmall stays nimble for CPU-centric deployments.

Training Data

OpenProvence v1 checkpoints are distilled from multilingual QA corpora that were re-labeled with the Qwen3-4B teacher. English coverage spans hotchpotch/msmarco-context-relevance, hotchpotch/gooaq-context-relevance-130k, and hotchpotch/natural-questions-context-relevance. Japanese coverage comes from hotchpotch/japanese-context-relevance, which includes MS MARCO JA and native QA sources. All datasets expose sentence-span keep/drop labels plus teacher reranker scores, so you can reproduce or extend the mixture for your own domains.

Training Recipe

This model family was trained with the open-source OpenProvence stack and is reproducible on a single ≥16 GB NVIDIA GPU.

Teacher Label Generation (DeepSeek-V3)
Use DeepSeek-V3 to annotate question/context relevance, producing the multilingual 140k-sample dataset qa-context-relevance-multilingual-140k.
Teacher Context-Relevance SFT (Qwen3-4B)
Fine-tune Qwen3-4B to build the multilingual teacher query-context-pruner-multilingual-Qwen3-4B, enabling fast, consistent span-level annotations.
Context-Relevance Dataset Construction
Generate sentence-span labels and teacher scores from the following corpora:
English:
Deduplicate near-identical negatives and follow the dataset creation guide for preprocessing tips.
Final Model Training
Distill existing reranker scores into a unified model that combines the cross-encoder reranker head with the context-pruning head. Reference docs/train.md for configuration details and baseline commands.

License

MIT License

Acknowledgements

Provence: efficient and robust context pruning for retrieval-augmented generation inspired the overall approach. Huge thanks to the Naver Labs Europe authors for releasing both the paper and the naver/provence-reranker-debertav3-v1 checkpoint that validated how powerful Provence-style pruning can be.
Sentence Transformers provided invaluable reference implementations for cross-encoder training that informed our pipelines.

Citation

If you use OpenProvence in your research, please cite it:

@misc{yuichi-tateno-2025-open-provence,
  url = {https://github.com/hotchpotch/open_provence},
  title = {OpenProvence: An Open-Source Implementation of Efficient and Robust Context Pruning for Retrieval-Augmented Generation},
  author = {Yuichi Tateno},
  year = {2025}
}

Authors

Yuichi Tateno (@hotchpotch)

Downloads last month: 1,287

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hotchpotch/open-provence-reranker-xsmall-v1

Base model

sbintuitions/modernbert-ja-30m

Finetuned

cl-nagoya/ruri-v3-pt-30m

Quantized

hotchpotch/japanese-reranker-xsmall-v2