How I Built a Local RAG System to Query NTSB Aviation Accident Reports

Before reading this, if you're new to RAG, I'd recommend starting with my earlier post RAG: Give Your AI a Better Memory which covers the concept, the core loop, and why RAG beats fine-tuning for most real world use cases. This post is the hands on sequel: here's how I actually built one, end to end, locally, with real data.

#🧠 The Problem I Wanted to Solve

The NTSB (National Transportation Safety Board) publishes detailed investigation reports for every civil aviation accident in the US. There's a public database with 50k Plus records from 1981 . But for my work I extracted around 26,000+ records going back to 2010. The data has 36 columns: location, aircraft, injuries, weather conditions, probable cause and a full narrative PDF for each completed investigation.
I wanted to ask natural questions like:
- "What caused most fatal accidents in Alaska in IMC conditions?"

- "What patterns appear in Cessna crashes during landing?"

- "Show me engine failure cases where the probable cause involved carburetor ice"
No existing tool lets you do this. So I built one.

# 🏗️ The Architecture at a Glance

Everything runs locally. No cloud. No paid vector database. The only API calls are to Gemini (for embeddings — free tier) and optionally Claude/Gemini (for generating answers).

ProjectStructure

bash

NTSB Website
    │
    ├── scripts/download.py    → Downloads ~1,000 PDFs via public API
    ├── scripts/split_csv.py   → Splits 26K-row CSV into 5-year bands
    │
    └── pipeline/
        ├── ingest.py          → PDF → text → section chunks → JSONL
        ├── setup_chromadb.py  → Creates local vector database
        ├── embed_and_store.py → JSONL → Gemini embeddings → ChromaDB
        └── query.py           → Your question → ChromaDB → AI answer

How I Built a Local RAG System to Query NTSB Aviation Accident Reports

#🧠 The Problem I Wanted to Solve

# 🏗️ The Architecture at a Glance

Comments (0)

Leave a comment

About MjShetty

#📥 Step 1 Get the Data

#📄 Step 2 — Download the PDFs (download.py)

#🔪 Step 3 — Ingest: PDF → Chunks (ingest.py)

#🗄️ Step 4 — Set Up ChromaDB (setup_chromadb.py)

#🧬 Step 5 — Embed and Store (embed_and_store.py)

#🔍 Step 6 — Query (query.py)

#💡 The Key Concept: Asymmetric Embeddings

#🔑 The Three Things That Made This Work

Trending Deep Dives

Related Deep Dives

Making RAG Smarter: What My NTSB Pipeline Was Missing

Decoding Paketo: Building Secure App Containers Without Dockerfiles