Making RAG Smarter: What My NTSB Pipeline Was Missing

In my previous post, How I Built a Local RAG System to Query NTSB Aviation Accident Reports, I built a working local RAG pipeline on top of 1,000+ NTSB aviation accident PDFs. The pipeline works. You ask a question, it retrieves relevant chunks from ChromaDB, and Gemini generates a grounded answer. But after deeper testing, I found five gaps that significantly hurt answer quality. This post documents each one with evidence from real queries, the fix, and a retest.

#🔍 What the Original Pipeline Does

Quick recap. Your question gets embedded by Gemini, searched against ChromaDB using cosine similarity and the top 5 chunks go to Gemini Flash for answer generation.

Loading Diagram...

This single retrieval path works well for broad semantic questions. Cosine similarity excels at matching meaning across different vocabulary. "Pilot error" matches "failure to maintain altitude" because the embedding model understands they mean the same thing.

But cosine has blind spots. And those blind spots create three specific failure modes that showed up in real testing.

Making RAG Smarter: What My NTSB Pipeline Was Missing

#🔍 What the Original Pipeline Does

#⚡ Gap 1: Hybrid Search Not Implemented (BM25 Missing)

Comments (0)

Leave a comment

About MjShetty

Why Cosine Fails Here

The Fix: BM25 + Reciprocal Rank Fusion

The Result

#🎯 Gap 2: No Reranker

Why This Happens

The Fix: Cohere rerank-v3.5

#🧹 Gap 3: No Chunk Deduplication(Dedup)

#🚦 Gap 4: No Score Cutoff

#🗣️ Gap 5: Vocabulary Gap (Query Expansion)

#The Result

#🏗️ The Final Pipeline

What Comes Next

Trending Deep Dives

Related Deep Dives

How I Built a Local RAG System to Query NTSB Aviation Accident Reports

Decoding Paketo: Building Secure App Containers Without Dockerfiles