Evaluate Your Own RAG: Why Best Practices Failed Us
We're Jimmy, a nuclear engineering company building France's first Small Modular Reactor (SMR). Our engineers need to search through thousands of complex technical documents: nuclear research papers, regulatory materials, multilingual scientific PDFs packed with equations, diagrams, and specialized terminology. Manual search wasn't cutting it—we needed a RAG system that could actually retrieve the right documents.
When we built it, we did what every good engineering team does: we followed the best practices. Context-aware chunking. Hybrid search. Careful chunk size optimization. Models from the MTEB leaderboard.
We benchmarked 156 queries across three languages. Nearly every "best practice" was wrong for our use case.
The results surprised us: naive chunking outperformed context-aware (70.5% vs 63.8%). Chunk size barely mattered. Hybrid retrieval lost to dense-only (69.2% vs 63.5%). The winning embedding model? Not even on the MTEB leaderboard.
This article shares our complete journey: benchmarking methodology, surprising performance findings, and hard-won lessons from deploying RAG in production. From choosing between Mistral OCR and open-source alternatives, to discovering why AWS OpenSearch cost us $70/day for mediocre results, to optimizing our infrastructure—we're sharing what actually worked (and what didn't).
The only way to know what works for you is to measure. Here's how we did it.
TL;DR (Too Long; Didn't Read)
Key Findings:
- AWS Titan V2 embeddings performed best across all metrics (69.2% hit rate vs 57.7% Qwen 8B, 39.1% Mistral embed)
- Chunk size doesn't really matter - no significant statistical difference between 2K and 40K characters for document-level retrieval
- Naive chunking outperformed context-aware strategy - simpler is better (70.5% vs 63.8% avg hit rate)
- Mistral OCR is unmatched for complex scientific PDFs (expensive but worth it)
- Don't use AWS OpenSearch for vector search - severely overpriced ($70/day minimum)
- Qdrant is great for vector DB - easy to use, solid performance, hybrid search, easy to self host, great managed service
- Dense-only search beat hybrid in our tests - dense gave better results than hybrid (69.2% vs 63.5% hit rate for Titan)
Remember, these results are specific to our use case (scientific documents in English, French and Japanese).
RAG Overview
Retrieval-Augmented Generation (RAG) combines large language models (LLMs) with external knowledge retrieval. Instead of relying solely on the LLM's training data, RAG fetches relevant documents and provides them as context for generation.
Core components:
- PDF Converter - Extract text from documents
- Chunking Strategy - Split documents into retrievable pieces
- Embedding Model - Convert text to vector representations
- Vector Database - Store and search embeddings
- Retrieval Mode - Dense, sparse, or hybrid search
- LLM - Generate responses from retrieved context
Our Benchmark Methodology
Setup
Our benchmark uses real documents from our production database:
- PDFs converted using Mistral OCR - complex scientific documents with equations, tables, and diagrams
- Stored in Qdrant - our vector database of choice
- Production configurations - same setup we use in our actual RAG system
Test Dataset
We created test questions covering different difficulty levels:
- Simple retrieval: Direct extractions from text (e.g., "Uncertainty in the specific heat capacity of helium")
- Complex questions: Conceptual understanding (e.g., "Synchronization in a coupled fluid–structure system?")
Our test dataset consists of 156 unique test queries per embedding model:
- 3 languages: English, French, Japanese
- 2 query forms: Interrogative ("What is...?") and affirmative statements ("Characteristics of...")
Here's an example from our test set:
MetaQuestion(
source_document="Jaiman et al. - 2023 - Mechanics of Flow-Induced Vibration Physical Mode.pdf",
related_questions=[
# Affirmatives
Question(
text="Cylinder vibrations induced by fluid flow",
language=Language.ENGLISH,
form=Form.affirmative,
),
Question(
text="Vibrations des cylindres induites par l’écoulement des fluides",
language=Language.FRENCH,
form=Form.affirmative,
),
Question(
text="流体の流れによって誘発される円筒の振動",
language=Language.JAPANESE,
form=Form.affirmative,
),
# Interrogatives
Question(
text="What can you tell me about cylinder vibrations induced by fluid flow?",
language=Language.ENGLISH,
form=Form.interrogative,
),
Question(
text="Que peux-tu me dire sur les vibrations des cylindres induites par l'écoulement des fluides ?",
language=Language.FRENCH,
form=Form.interrogative,
),
Question(
text="流体の流れによって誘発される円筒の振動について教えてください。",
language=Language.JAPANESE,
form=Form.interrogative,
),
],
)
Each query has a known ground truth - we know which document should be retrieved.
Metrics
Important Note on Our Retrieval Goal:
Goal: Our primary objective was to retrieve the right document, not necessarily the exact paragraph or section. Once we have the correct document, our LLM can extract the specific information needed. This is a critical distinction that influenced our benchmark design and conclusions.
We measured retrieval performance using:
- Top-10 Recall (Hit Rate): % of queries where correct document appears in top-10 results
- MRR (Mean Reciprocal Rank): Average of 1/rank for correct documents (higher is better) multiplied by 100 for normalisation
- Top-1 Recall: % of queries where correct document is in top-1
Since we only care about document-level retrieval, not chunk-level precision, this has important implications for chunk size optimization (see findings below).
Chunking Strategies
We tested two approaches:
Naive chunking: Simple character-based splitting with overlap using LangChain's RecursiveCharacterTextSplitter. Chunks are created by splitting at natural boundaries (paragraphs, sentences) without understanding document structure.
Context-aware chunking: Parses markdown structure (headings, sections) into a hierarchical tree. Each chunk includes its parent section headings as context. For example, a chunk from "Section 2.3 → Subsection 2.3.1" includes both heading levels, preserving document structure.
Test Configurations
We benchmarked different combinations:
- Chunking strategies: Naive vs context-aware
- Chunk sizes: 2K, 4K, 6K characters for Titan V2; 2K for Mistral Embed; 2K, 10K, 40K characters for Qwen 8B (taking advantage of its larger 32K token context window)
- Embedding models: AWS Titan V2, Qwen 8B, Mistral
- Retrieval modes: Dense-only, hybrid (dense + sparse), sparse-only
This gave us multiple configurations to systematically compare across all dimensions.
Results & Key Learnings
1. Embedding Models: AWS Titan Wins (Surprisingly!)
We compared three embedding models across all configurations, the chunk size strategy was the naive one with a chunk size of 2000 and an overlap of 400.
Model Selection Criteria:
We specifically wanted API-accessible embedding models to avoid managing model inference infrastructure. This led us to test:
- AWS Titan V2 (
amazon.titan-embed-text-v2:0): Accessed via AWS Bedrock - Qwen 8B (
Alibaba-NLP/gte-Qwen2-7B-instruct): Accessed via Hugging Face Inference API (provider: Nebius AI) - Mistral Embed (
mistral-embed): Accessed via Mistral AI API
All three provide simple REST API access, which was essential for our production requirements.
Results:
- AWS Titan V2: 69.2% hit rate
- Qwen 8B: 57.7% hit rate
- Mistral: 39.1% hit rate
The MTEB leaderboard surprise:
If you're choosing an embedding model, you'd naturally check the MTEB (Massive Text Embedding Benchmark) leaderboard on Hugging Face. It's the standard benchmark for comparing embedding models across various tasks.
Here's the surprising part: AWS Titan V2 isn't even on the MTEB leaderboard. Yet it outperformed both Qwen 8B and Mistral (which ARE on the leaderboard) for our use case.
Why Titan wins for us:
- Cheaper than the alternatives and with higher rate limits
- Better multilingual performance (critical for our EN/FR/JA documents)
- More robust to scientific terminology and technical jargon
- Very inexpensive
The trap of traditional benchmarks:
Most embedding benchmarks test on English-only affirmative queries. Here's what happens when we limit our analysis to just those conditions:
Critical Insight: Under "traditional" benchmark conditions (English affirmative questions only), Mistral performs almost on par with Titan (76.9% vs 80.8% hit rate). This looks promising! However, when we tested across all languages and query forms, Mistral's performance dropped significantly (39.1% overall vs 69.2% for Titan).
The key difference: consistency. Titan maintains strong performance across English, French, Japanese, and both query forms. Mistral excels in narrow conditions but lacks robustness across diverse real-world queries.
This is precisely why we chose Titan over Mistral despite Mistral's competitive performance under ideal conditions. Production RAG systems need models that work consistently across varied query patterns, not just those that excel in controlled benchmarks.
Lesson learned: When evaluating embedding models, test them under diverse conditions that match your production use case. A model that dominates English-only benchmarks may struggle with multilingual content or different query formulations. Look for consistency, not just peak performance.
2. Chunk Size: Don't Overthink It
We tested various chunk sizes: 2K, 4K, 6K characters with Titan V2 and 2K, 10K, 40K characters with Qwen 8B.
Results: Chunk size had minimal impact on performance. The variation between different chunk sizes was negligible—all configurations performed within a few percentage points of each other.
Why chunk size doesn't matter much for us:
Remember, our goal is document-level retrieval, not finding the exact paragraph. As long as any chunk from the correct document ranks highly, we succeed. Larger chunks still contain the relevant content, just with more surrounding context.
Practical recommendation: Don't over-optimize this
Unlike typical RAG advice that emphasizes careful chunk size tuning, our data shows chunk size is simply not a critical parameter. Here's what we found:
- No performance penalty with larger chunks - 2K performed similarly to 40K
- Larger chunks = Lower costs - Fewer chunks means:
- Inference provider: less overlap tokens (cheaper and less chance to hit rate limit) and faster to embed all documents
- Vector DB: less storage, faster indexing, and fewer similarity comparisons at query time
What we use:
- Titan V2: 6K characters (we hit errors beyond this, even though the model claims 8K token support - unclear why)
- Qwen 8B: 40K characters (works fine with its 32K token context window)
Bottom line: We switched from 2K to larger chunks for cost efficiency, but honestly, any reasonable size will work. Don't spend days optimizing this—focus on embedding model selection and retrieval mode instead.
3. Chunking Strategy: Naive Matches Context-Aware
We tested naive chunking (simple character-based splitting) vs context-aware chunking (respecting document structure like sections and paragraphs).
Results: Naive chunking outperformed context-aware chunking for Titan embeddings (71.8% vs 67.9% hit rate at best, 70.5% vs 63.8% average across chunk sizes with dense-only).
Interpretation:
- For technical documents, strict structural boundaries (like section breaks) can split related content
- Naive chunking with overlap captured context sufficiently
- Context-aware chunking adds complexity without clear benefit in our case
Recommendation: Start simple with naive chunking. Only use context-aware if you have specific structure-dependent requirements.
4. Dense vs Hybrid Search in Qdrant (Surprising Result)
Understanding Retrieval Modes:
Vector databases like Qdrant support different retrieval approaches:
Dense search: Uses semantic embeddings (like Titan or Qwen) to find documents based on meaning. Converts queries and documents into high-dimensional vectors and finds similar vectors. Great for conceptual matches but can miss exact keyword requirements.
Sparse search: Uses keyword-based matching (similar to BM25 or TF-IDF). Qdrant implements this via FastEmbed's sparse embeddings (we used
prithvida/Splade_PP_en_v1). Excellent for exact term matches but misses semantic similarity.Hybrid search: Combines both approaches—semantic understanding from dense embeddings + keyword precision from sparse embeddings. The results are merged using score fusion. Conventional wisdom says this should always be better than either approach alone.
Our Findings:
Conventional wisdom says hybrid search should outperform dense-only. This was actually why we chose Qdrant. Our data shows the opposite for our use case.
Findings: Dense-only achieved 69.2% hit rate vs 63.5% for hybrid (using Titan embeddings with 2K character naive chunking).
Why this might happen:
- For scientific documents with technical terminology, dense embeddings alone captured semantic meaning effectively
- Sparse search can introduce noise from keyword matches that lack semantic context
- Your mileage may vary - this is specific to our document types and Qdrant's FastEmbed implementation
Important context: We chose Qdrant specifically for its hybrid search capabilities. While dense-only performed better in our benchmarks, hybrid remains valuable for:
- Exact keyword matching requirements
- Regulatory/compliance searches where specific terms matter
- Fallback when dense embeddings struggle with rare terms
Recommendation: Benchmark both modes on your specific corpus. Don't assume hybrid is always better.
5. Multilingual Performance
Our documents span English, French, and Japanese. We needed robust cross-lingual retrieval.
Results:
- English: 73.1% hit rate
- French: 48.7% hit rate
- Japanese: 44.2% hit rate
English significantly outperformed French and Japanese. This suggests our RAG system works best with English content, though French and Japanese still achieved reasonable retrieval rates. The multilingual support of Titan embeddings was validated, even if performance varied by language.
Model Comparison Across Languages:
The graph below compares all three embedding models (Titan, Qwen, and Mistral) across the three languages we tested:
Titan consistently outperformed other models across all languages, with particularly strong performance in English. The gap between models was most pronounced in French and Japanese, where Titan's multilingual capabilities showed clear advantages.
Model Comparison Across Query Forms:
We also analyzed performance across different query forms (interrogative vs affirmative):
Interestingly, TITAN and Mistral perform better with affirmative statements, as theory would predict (since most information in text is presented in the affirmative form). Qwen, however, performs better with interrogative statements, which really does not make any sense.
Practical Decisions (Non-Benchmarked)
PDF Conversion: Mistral OCR
Decision: We use Mistral OCR for all document processing.
Evaluation: We manually tested 3 complex PDFs (equations, tables, diagrams, scanned pages) with:
Findings: Only Mistral OCR correctly parsed complex mathematical notation and tables in scanned documents. The output quality difference was dramatic, it even worked with chemical equations.
Why Markdown Conversion is Critical:
Converting all documents (PDFs, Word files, etc.) to markdown isn't just about performance—it's absolutely essential for building a debuggable RAG system:
- Debuggability: When retrieval fails or returns wrong results, you need to inspect what was actually indexed. With raw PDFs, you're blind. With markdown, you can open the file, search for the expected content, and understand exactly what your chunking strategy did to it.
- Performance: Markdown is lightweight and fast to process. Chunking markdown text is orders of magnitude faster than re-parsing PDFs on every embedding update.
- Reproducibility: Markdown gives you a stable, version-controllable representation of your documents. You can track changes, compare versions, and ensure consistency across your pipeline.
- Iteration speed: Testing different chunking strategies on markdown takes seconds, not minutes. This makes experimentation practical.
Bottom line: Without markdown as an intermediate format, you can't effectively debug or iterate on your RAG system. It's not optional—it's foundational.
Trade-off: Mistral OCR is expensive (1$ per 1000 pages). For scientific documents where accuracy is critical, it's worth it. If you have simpler PDFs, try open-source alternatives first.
Vector Database: Qdrant
Decision: We use Qdrant (managed service).
Context: We evaluated Milvus, Qdrant, AWS OpenSearch, Pinecone, and PostgreSQL with pgvector. Being AWS-native, we initially tried OpenSearch.
Critical finding: Don't use AWS OpenSearch for vector search. The cheapest option is $70 /day (~$2,100/month) for a single-node cluster. This is severely overpriced for what it offers.
Why Qdrant:
- Native hybrid search support with FastEmbed (our original motivation)
- Easy Docker setup for self-hosting
- Later migrated to managed Qdrant service to avoid maintenance
- Solid performance and features
- Reasonable pricing on Qdrant Cloud
Why not Qdrant:
- Admin in Qdrant Cloud lacks basic features, like transfering ownership
- Sometimes the collections stay in "gray mode" for unclear reasons (unclear at least to the noobies we are), all you have to do is manually "start optimization", but it is still weird.
Why not the other alternatives:
- PostgreSQL seemed overkill since we are not using PostgreSQL (or Aurora from AWS) and we where scared to be paying a lot like we did with opensearch for a managed service.
- Pinecone is not open source.
- Milvus vs Qdrant, these are the two most starred open source databases. We had to pick one so one of our teammate just tried to deploy both in a self hosted server, and it took him longer to do it on Milvus, so Qdrant won. I acknowledge this is not the best reason.
Conclusion
Building a production RAG system requires balancing performance, cost, and complexity. Here's our recommended starting point:
Recommended Configuration:
- Embedding Model: AWS Titan V2 (69.2% hit rate - best for multilingual scientific content)
- Chunk Size: Don't overthink it - any reasonable size works (we use 6K for Titan, 40K for Qwen for cost efficiency)
- Chunking Strategy: Naive chunking (70.5% avg hit rate vs 63.8% for context-aware - simpler and better)
- Retrieval Mode: Dense-only for our use case (69.2% vs 63.5% for hybrid with Titan)
- Vector DB: Qdrant or Milvus (avoid AWS OpenSearch due to cost)
- PDF Conversion: Mistral OCR for complex scientific documents (expensive but necessary)
Key Takeaway: Don't blindly follow "best practices" from blog posts. Benchmark on your specific document types and query patterns. Our findings contradicted common wisdom (dense-only beating hybrid, naive beating context-aware), but they were reproducible and significant for our use case. What worked for us may not work for you. The only way to know is to measure.
Note on Data Availability
Our benchmark code and methodology are detailed in this article for reproducibility. However, the scientific documents we used are proprietary and closed-source (nuclear engineering research and regulatory materials). The source code as well can't be shared, it is part of our mono-repo.
While we can't share the raw data, we're happy to answer questions about our methodology, testing approach, or specific findings. Feel free to reach out if you're implementing something similar!
Questions or feedback? We'd love to hear about your RAG experiences, especially if you found different results!
Implementation Details
Here are key code snippets from our implementation that you might find useful, it is very basic:
Chunking Strategies
Naive Chunking - Simple character-based splitting with overlap using LangChain:
# src/pyjimmy/rag/chunk.py:32-34
def split_markdown(s3_markdown_path: Path, max_chunk_size: int, strategy: ChunkingStrategy) -> list[str]:
if strategy == ChunkingStrategy.naive:
return RecursiveCharacterTextSplitter(
chunk_size=max_chunk_size,
chunk_overlap=400,
add_start_index=True
).split_text(markdown_text)
Context-Aware Chunking - Preserves document structure by parsing markdown hierarchically:
# src/pyjimmy/rag/chunk.py:82-99
class Section:
def __init__(self, body: str = "", title: str | None = None, level: int = 0):
self.body: str = body
self.title: str | None = title
self.level: int = level
self.children: list[Section] = []
@classmethod
def from_markdown(cls, markdown_text: str) -> Section:
"""Parse markdown into hierarchical sections based on heading levels."""
lines = markdown_text.split("\n")
root = cls()
stack = [root]
for line in lines:
if line.startswith("#"):
level = len(line) - len(line.lstrip("#"))
title = line.lstrip("#").strip()
new_section = cls(title=title, level=level)
while stack and stack[-1].level >= level:
stack.pop()
stack[-1]._add_child(new_section)
stack.append(new_section)
else:
if stack:
stack[-1].body += line + "\n"
return root
def to_chunks(self, max_chunk_size: int, context: str = "") -> list[str]:
context += self.current_context
chunks = []
if self.body.replace("\n", "").strip():
if len(context) >= max_chunk_size:
logger.warning(
f"Context too large ({len(context)} chars) for max_chunk_size {max_chunk_size}. Skipping context."
)
context = ""
chunks += [
context + splitted_body
for splitted_body in _split_text_into_chunks(self.body, max_chunk_size - len(context))
]
for child in self.children:
chunks += child.to_chunks(max_chunk_size, context)
return chunks
PDF Processing with Mistral OCR
Converting PDFs to Markdown - Handling complex scientific documents:
# src/pyjimmy/rag/pdf.py:22-45
def convert_pdf_to_markdown(pdf_path: Path, output_path: Path) -> tuple[Path, Path]:
logger.info(f"Start converting PDF {pdf_path} to markdown...")
# Split large PDFs to stay under 50MB API limit
pdfs_under_50mb = _split_pdf_to_under_50mb_improved(pdf_path, output_path)
final_markdown = ""
final_pages = []
for pdf in pdfs_under_50mb:
ocr_response = _convert_pdf_under_50MB_to_markdown(pdf)
markdown = _get_combined_markdown(ocr_response=ocr_response)
pages = ocr_response.model_dump()["pages"]
final_markdown += markdown
final_pages.extend(pages)
markdown_path = output_path / f"{pdf_path.stem}.md"
markdown_path.write_text(final_markdown)
return markdown_path, yaml_path
Mistral OCR API Call:
# src/pyjimmy/rag/pdf.py:175-192
def _convert_pdf_under_50MB_to_markdown(pdf_path: Path) -> OCRResponse:
uploaded_pdf = MISTRAL_CLIENT.files.upload(
file=File(
file_name=pdf_path.name,
content=pdf_path.read_bytes(),
content_type="application/pdf"
),
purpose="ocr",
)
signed_url = MISTRAL_CLIENT.files.get_signed_url(file_id=uploaded_pdf.id)
return MISTRAL_CLIENT.ocr.process(
model="mistral-ocr-latest",
document={"type": "document_url", "document_url": signed_url.url},
include_image_base64=True,
)
Embedding Models Setup
Configuring Different Embedding Models:
# src/pyjimmy/rag/embedding.py:33-58
def get_dense_embedding_model(model: DenseEmbeddingModel) -> Embeddings:
if model == DenseEmbeddingModel.titan:
# AWS Titan V2: 8,192 max tokens, 50K max characters
return BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")
if model == DenseEmbeddingModel.qwen:
# Qwen 8B: 32K token context window
return HuggingFaceEndpointEmbeddings(
client=InferenceClient(provider="nebius", api_key=token),
model="Qwen/Qwen3-Embedding-8B"
)
if model == DenseEmbeddingModel.mistral:
return MistralAIEmbeddings(
mistral_api_key=os.environ["PYJIMMY_MISTRAL_API_KEY"],
model="mistral-embed"
)
# Sparse embeddings for hybrid search
SPARSE_EMBEDDING_MODEL = FastEmbedSparse(model_name="prithvida/Splade_PP_en_v1")
Qdrant Vector Store Configuration
Creating Collections with Hybrid Search:
# src/pyjimmy/rag/qdrant.py:27-40
def create_vector_store(collection_name: str, model: DenseEmbeddingModel):
logger.info(f"Creating Qdrant collection {collection_name}...")
QdrantVectorStore.from_texts(
texts=[],
url=QDRANT_URL,
api_key=QDRANT_API_KEY,
collection_name=collection_name,
embedding=get_dense_embedding_model(model),
sparse_embedding=SPARSE_EMBEDDING_MODEL, # For hybrid search
retrieval_mode=RetrievalMode.HYBRID,
)
Adding Documents with Retry Logic:
# src/pyjimmy/rag/qdrant.py:64-72, 78-104
@retry(wait=wait_fixed(60))
def _add_texts_to_vector_store(
vector_store: QdrantVectorStore,
texts: list[str],
s3_folder: Path
) -> list[str]:
"""Retry on failure due to embedding API rate limits."""
return vector_store.add_texts(
texts,
metadatas=[{"s3_folder": s3_folder} for _ in texts]
)
def add_markdown_to_qdrant(
s3_markdown_path: Path,
max_chunk_size: int,
chunking_strategy: ChunkingStrategy,
model: DenseEmbeddingModel
):
chunks = split_markdown(s3_markdown_path, max_chunk_size, strategy=chunking_strategy)
vector_store = get_vector_store(collection_name, model=model)
vector_ids = []
max_chunks_one_request = 10
# Batch processing to avoid timeouts
for start_index in range(0, len(chunks), max_chunks_one_request):
vector_ids.extend(
_add_texts_to_vector_store(
vector_store,
chunks[start_index : start_index + max_chunks_one_request],
s3_folder=s3_folder
)
)











