In my previous post, How I Built a Local RAG System to Query NTSB Aviation Accident Reports, I built a working local RAG pipeline on top of 1,000+ NTSB aviation accident PDFs. The pipeline works. You ask a question, it retrieves relevant chunks from ChromaDB, and Gemini generates a grounded answer. But after deeper testing, I found five gaps that significantly hurt answer quality. This post documents each one with evidence from real queries, the fix, and a retest.
Quick recap. Your question gets embedded by Gemini, searched against ChromaDB using cosine similarity and the top 5 chunks go to Gemini Flash for answer generation.
This single retrieval path works well for broad semantic questions. Cosine similarity excels at matching meaning across different vocabulary. "Pilot error" matches "failure to maintain altitude" because the embedding model understands they mean the same thing.
But cosine has blind spots. And those blind spots create three specific failure modes that showed up in real testing.
If you found this helpful, please like and share to support the content!
Always curious to understand the concept, learning by breaking and fixing, and passionate about sharing knowledge with the community.Get in touch with me→
I queried for a specific NTSB report by its ID:ERA22LA175
Query: python local_rag/query.py --query "ERA22LA175"
The result? Five wrong reports, all with scores around 0.35 (near random). Gemini answered: "The provided context does not contain any information regarding ERA22LA175."
The correct report was sitting right there in ChromaDB. Cosine could not find it because a report ID like ERA22LA175 has no semantic meaning. Its embedding direction is random noise.
Cosine works on meaning encoded as direction in vector space. It is great at vocabulary bridging ("bad weather" matches "IMC conditions"). It is fundamentally incapable of exact term matching. Report IDs, aircraft registration numbers, and part names are just alphanumeric strings with no semantic direction.
BM25(Best Match 25) is a classical keyword ranking algorithm. It scores chunks based on term frequency, inverse document frequency, and length normalisation. No embeddings, no API calls. It runs in memory using the rank-bm25 Python library.
For the query ERA22LA175, BM25 finds every chunk containing that exact string and scores it at 13.9. Cosine gave the same chunk a score of 0.331 at rank #12.
The challenge: BM25 scores range 0 to 15, cosine scores range 0 to 1. You cannot add them directly. Reciprocal Rank Fusion (RRF) solves this by using ranks instead of scores:
RRF_K = 60
rrf_score = 1/(bm25_rank + RRF_K) + 1/(vector_rank + RRF_K)After adding hybrid search, the same query returned ERA22LA175_chunk_0 at position #1 with a BM25 score of 13.941. Gemini now answered with the full accident details: Learjet in Morristown, New Jersey, loss of control on ground, 4 minor injuries.
I queried a specific report for causal analysis: python local_rag/query.py --query "what mistakes did the pilot make" --ntsb-no CEN22LA359
Cosine ranked Findings (short bullet labels) at #1 and buried Analysis (full causal narrative) at #3. All five scores clustered tightly between 0.526 and 0.372. Cosine could not tell which chunk actually answered the question better.
The answer was shallow: just the label "Personnel issues Monitoring environment" pulled from a bullet point.
Cosine embeds the query and each chunk independently, then compares vectors. It never reads both together. A reranker is a cross-encoder model that reads the query and chunk as a single input, scoring how well the passage actually answers the question.
After RRF merges the top 20 candidates, Cohere's reranker rescores them:
import cohere
RERANK_MODEL = 'rerank-v3.5'
RERANK_TOP_N = 20 # candidates passed to reranker
def rerank(cohere_client, query, candidates, top_k):
documents = [c['document'] for c in candidates]
response = cohere_client.rerank(
model = RERANK_MODEL,
query = query,
documents = documents,
top_n = top_k,
)
reranked = []
for hit in response.results:
c = candidates[hit.index]
reranked.append({**c, 'rerank_score': hit.relevance_score,
'rerank_position': hit.index})
return rerankedProbable Cause jumped to #1. The Wreckage section got pulled from RRF position #9 to #4 because the reranker recognized crash details as relevant to understanding the mistake. The answer now included the full NTSB determination plus the narrative about the pilot hitting a tree while crop-dusting.
Each NTSB report splits into roughly 10 chunks (History of Flight, Analysis, Probable Cause, Findings, and more). That means 38,232 chunks across about 3,800 reports. For broad pattern queries like "common causes of fatal landing accidents", a single highly relevant report can dominate 3 to 4 of the top 5 slots, crowding out other relevant accidents entirely.
Consider what happens without dedup. If ERA23FA109 is the closest match, you might get its Analysis at #1, Probable Cause at #2, Findings at #3, and History of Flight at #4. Your LLM sees one accident in deep detail but misses the four other crashes that would reveal the actual pattern.
For single-report queries with --ntsb-no, this behaviour is actually useful since you want all sections of that report. For pattern queries across the dataset, it kills breadth.
The reranker already reduces this problem by penalising redundant chunks from the same report. In test runs, all 5 results came from different reports without any dedup logic. A formal dedup filter (keep only the highest-scoring chunk per ntsb_no, auto-disabled in single-report mode) is ready to implement when the dataset grows larger.
The Failure
The system retrieved top-k chunks regardless of how low the scores were. A cosine score of 0.35 is noise, but it still got sent to the LLM as "relevant context." Gemini had to process 5 irrelevant chunks just to conclude there was nothing useful.
The Fix
Two thresholds added after reranking:
RERANK_SCORE_CUTOFF = 0.10 # when reranker is active
RRF_SCORE_CUTOFF = 0.015 # when reranker is disabled (RRF-only)
After reranking, check the top chunk's score before calling the LLM:
if final_chunks[0]['rerank_score'] < rerank_cutoff:
log.info(f"Best rerank score {score:.4f} < {rerank_cutoff} cutoff")
print("No relevant reports found for this query (confidence too low).")
return # ← Gemini call skipped entirely
If the top chunk scores below the threshold, the system returns "No relevant reports found" without calling Gemini at all.
The Result
An out-of-domain query like "what is the capital of France" hit a best rerank score of 0.0667. The cutoff fired, Gemini was never called. Zero API cost, zero wasted tokens. The system printed a clear message asking the user to rephrase or check their filters.
But an aviation query where data was only partially present ("Airbus A380 crash in Mumbai in 2019") scored 0.12 to 0.18 and correctly passed through. The LLM honestly explained that the dataset contained related but non-matching incidents. It found a Boeing in Mumbai and an Airbus 220 in New York, but no A380. The 0.10 threshold cleanly separates non-aviation noise from real aviation queries, even when the specific accident is not in the dataset.
Set your cutoff too high and you will silently drop valid results. Start conservative (0.10 for reranker, 0.015 for RRF) and tune based on your dataset.
The Failure
Users write "pilot confused in clouds lost control." NTSB reports say "spatial disorientation," "inadvertent IMC," "VFR-into-IMC," "LOC-I." The hybrid system partially bridged this, but all 5 results were Findings sections (short bullet labels). The full Analysis narratives explaining how disorientation developed were ranked lower.
The Fix
Before retrieval, one extra LLM call rewrites the query into NTSB terminology:
EXPAND_PROMPT = """Rewrite the user query using precise NTSB terminology.
Output ONLY the rewritten query. Keep it under 30 words.
Preserve identifiers exactly.
Input: pilot confused in clouds lost control
Output: spatial disorientation inadvertent IMC loss of aircraft control VFR-into-IMC
"""The expanded query feeds into both BM25 and vector embedding. The original query still goes to Gemini for answer generation so the response addresses what the user actually asked.
BM25 scores jumped from roughly 10 to 25 to 36. The reranker returned confidence scores of 0.68 to 0.73, up from 0.49 to 0.52 without expansion. Analysis sections explaining the full sequence of spatial disorientation events now appeared in the top 5 instead of just Findings bullet labels.
The final answer covered five fatal accidents with full causal narratives. Not just "Personnel issues Spatial disorientation" labels, but the actual story of what happened in each cockpit .
Every layer is independently toggleable via CLI flags: --no-hybrid, --no-rerank, --no-expand, --score-cutoff 0. This lets you A/B test each improvement against the baseline.
Total added latency: about 6 seconds for BM25 index build (once per session), 300ms for query expansion, 330ms for Cohere reranking. For a system where answer quality matters more than millisecond latency, that is a good trade.
Chunk deduplication is ready to implement when the dataset grows. Beyond that, the next areas to explore are multi-hop retrieval (follow-up queries based on initial results), section-aware chunking (preserving logical boundaries in NTSB reports), and automated evaluation (a test suite of known queries with expected results to measure each pipeline change systematically).
Every improvement here started the same way: running a query, noticing the answer was not as good as it should be, and tracing the failure back to a specific gap in the retrieval pipeline. The system does not need to be perfect. It needs to fail in ways you can observe, diagnose, and fix.

A step-by-step walkthrough on building a Retrieval-Augmented Generation (RAG) system locally, using ChromaDB and Gemini embeddings, to query 26K+ NTSB aviation accident reports with natural language questions.

Delete your Dockerfile. Learn how Paketo Buildpacks use the pack CLI to create secure, SBOM-ready Node.js images automatically.