From Local RAG to Production AWS: Porting My NTSB Pipeline to the AWS cloud

From Local RAG to Production AWS: Porting My NTSB Pipeline to the AWS cloud | NanoTechBytes | NanoTechBytes

Source Code

python

index_body = {
    "settings": {"index": {"knn": True, "knn.algo_param.ef_search": 100}},
    "mappings": {
        "properties": {
            "embedding": {
                "type": "knn_vector",
                "dimension": 3072,
                "method": {
                    "name": "hnsw",
                    "engine": "faiss",
                    "space_type": "innerproduct"
                }
            },
            "text":            {"type": "text"},
            "ntsb_no":         {"type": "keyword"},
            "section":         {"type": "keyword"},
            "state":           {"type": "keyword"},
            "make":            {"type": "keyword"},
            "injury_severity": {"type": "keyword"}
        }
    }
}

Source Code

python

def handler(event, context):
    for record in event.get("Records", []):
        bucket = record["s3"]["bucket"]["name"]
        key    = record["s3"]["object"]["key"]

        ntsb_no = os.path.basename(key).replace(".pdf", "")

        s3_client.download_file(bucket, key, f"/tmp/{ntsb_no}.pdf")
        csv_meta = load_csv_metadata()

        chunks  = get_chunks(f"/tmp/{ntsb_no}.pdf", ntsb_no, csv_meta)
        chunks  = embed_chunks(chunks)
        indexed = index_chunks(chunks)

Source Code

python

# Local: rank-bm25 library, loads all chunks into memory
bm25 = BM25Okapi(tokenized_docs)

# AWS: OpenSearch built-in, no memory overhead
body = {"query": {"match": {"text": {"query": query_text}}}}

# Vector search uses Opensearch KNN:

body = {
    "size": 20,
    "query": {
        "knn": {
            "embedding": {"vector": query_embedding, "k": 20}
        }
    }
}

From Local RAG to Production AWS: Porting My NTSB Pipeline to the AWS cloud

#🏗️ Architecture Overview

Comments (0)

Leave a comment

About MjShetty

📦 Step 1: S3 Bucket Design

🔎 Step 2: OpenSearch Serverless as the Vector Store

⚙️ Step 3: Ingest Lambda, Event-Driven PDF Processing

🔍 Step 4: Query Lambda, the Full Pipeline on AWS

RAG Testing and Results

#What's Built and Tested

#What's Next

Related Deep Dives

Making RAG Smarter: What My NTSB Pipeline Was Missing

How I Built a Local RAG System to Query NTSB Aviation Accident Reports

Trending Deep Dives