Why Your RAG Pipeline Is Underperforming
The four structural pathologies that cause most production RAG systems to underperform — and how to fix them.
Why Your RAG Pipeline Is Underperforming
The majority of RAG systems deployed in production suffer from the same structural pathologies. Not because teams lack expertise, but because tutorials and quickstarts have normalized shortcuts that become traps at scale. In the production RAG pipeline audits we have conducted, the same four problems recur frequently — their prevalence varies by team and domain.
Naive Chunking: The Primary Culprit
Fixed-size chunking is the original sin of industrial RAG. Splitting a document into 512-token blocks without semantic consideration produces chunks that break reasoning mid-sentence, mid-table, or mid-argument. The retriever then fetches incoherent fragments, and the model generates accordingly.
# Naive configuration — avoid this
chunker = FixedSizeChunker(
chunk_size=512,
overlap=50
)
# Semantic configuration — recommended approach
chunker = SemanticChunker(
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95,
min_chunk_size=200,
max_chunk_size=1000,
# Respect natural boundaries: paragraphs, sections
sentence_split_regex=r"(?<=[.!?])\s+",
)
# Hierarchical chunking for structured documents
chunker = HierarchicalChunker(
parent_chunk_size=2048, # large context for generation
child_chunk_size=256, # fine granularity for retrieval
overlap_ratio=0.1
)Semantic chunking uses segmentation models to detect natural thematic breaks. Hierarchical chunking, popularized by LlamaIndex, maintains two levels of granularity: small chunks for precise retrieval, large chunks for contextual generation.
Non-Specialized Embeddings
Using a generalist embedding not evaluated on your domain for a medical, legal, or technical corpus is a design error. These models optimize for general semantic similarity, not domain-specific precision.
The concrete problem: in an API documentation corpus, "POST method" and "agile method" have high cosine similarity with a generalist embedding. A specialized technical-code embedding correctly distinguishes them.
Alternatives by domain:
- Code:
voyage-code-2,text-embedding-3-largewith fine-tuning - Medical/legal: models fine-tuned on PubMed, legal corpora
- Multilingual:
multilingual-e5-large,LaBSE - Long documents:
jina-embeddings-v2(8192 token context)
The Absence of Re-ranking
Vector similarity retrieval is a first filter, not a final answer. Top-K results by cosine similarity invariably include noise. Re-ranking is the step that separates amateur RAG systems from production systems.
A cross-encoder like cross-encoder/ms-marco-MiniLM-L-6-v2 re-scores candidates by considering the query and each document together — which is fundamentally more accurate than vector similarity calculated separately.
The gain observed in our internal audits: +12 to +18 nDCG@10 points after adding a re-ranker, on certain tested corpora. This figure depends heavily on the domain, chunk quality, and base model — it should not be read as universal.
Hybrid Retrieval: Dense + Sparse
Purely dense (vector) retrieval fails on precise lexical queries: proper nouns, identifiers, technical acronyms. BM25 remains unbeatable for this type of query. The solution is hybrid retrieval with score fusion.
# Reciprocal Rank Fusion — robust method without critical hyperparameter
def reciprocal_rank_fusion(results_lists, k=60):
scores = {}
for results in results_lists:
for rank, doc_id in enumerate(results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores, key=scores.get, reverse=True)Actionable Recommendations
- Audit your chunks: calculate size distributions, visualize samples. If you see truncated sentences, migrate to semantic chunking.
- Benchmark your embeddings on a domain-specific test set (50–100 gold query/document pairs) before choosing your model.
- Add a re-ranker as a priority if your pipeline lacks one — it is the fastest gain available.
- Enable hybrid retrieval: integrate BM25 in parallel with dense retrieval using RRF. Most frameworks (LangChain, LlamaIndex, Weaviate) support this natively.
- Measure: set up a RAGAS or TruLens benchmark to track faithfulness, answer relevancy, and context precision at each iteration.
RAG is not a solved problem. It is an engineering system that demands the same rigor as a traditional data pipeline.