Building an Agentic RAG System from Scratch: Architecture, Hybrid Search, and Adaptive Query Routing
A deep dive into RAG Talk — a conversational AI system with hybrid BM25+vector search, Reciprocal Rank Fusion, cross-encoder reranking, and an intelligent agentic routing layer that adapts retrieval strategy to query intent.
Large Language Models are impressive, but they hallucinate. They confidently fabricate quotes, invent citations, and present fiction as fact. For any application where accuracy matters — and that's most of them — you need a way to ground LLM responses in real data.
That's the core premise behind RAG Talk, a conversational AI system I built from scratch that lets users have in-depth conversations with AI personas modelled after historical thinkers — Charlie Munger, Benjamin Franklin, Naval Ravikant, and others. Every response is grounded in the figure's actual writings, speeches, and interviews, with inline citations so users can verify claims.
What started as a straightforward RAG prototype quickly became a study in retrieval engineering: hybrid search, cross-encoder reranking, query decomposition, and ultimately an agentic routing layer that decides how to retrieve based on what the user is asking. This post walks through the architecture, the trade-offs, and the things I'd do differently.
Architecture Overview
┌──────────────────────────────────────────────────────────────┐
│ Frontend (Next.js) │
│ Chat UI ←── SSE Stream ──→ Pipeline Metadata Display │
└──────────────────────┬───────────────────────────────────────┘
│ POST /api/chat (SSE)
┌──────────────────────▼───────────────────────────────────────┐
│ FastAPI Backend │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Agentic Query Router │ │
│ │ classify_query() → GREETING | SIMPLE | COMPLEX | FOLLOW│ │
│ └────────┬────────────┬──────────┬────────────┬───────────┘ │
│ │ │ │ │ │
│ Direct LLM Single-pass Multi-step Context-aware │
│ (no retrieval) RAG Decompose Coreference │
│ │ │ │ │ │
│ ┌────────▼────────────▼──────────▼────────────▼───────────┐ │
│ │ Retrieval Pipeline │ │
│ │ Query Rewrite → Hybrid Search → RRF Fusion → Reranking │ │
│ └─────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘Tech stack: Next.js 16 frontend, FastAPI backend, ChromaDB for vector storage, OpenRouter for LLM access, BM25 for sparse search, custom scrapers for data ingestion.
Deep Dive: The Retrieval Pipeline
Why Hybrid Search?
Early in development, I tested pure embedding search and immediately noticed a pattern: it was great at capturing semantic similarity but terrible at exact-match lookups. Ask "What did Munger say about Costco?" and embedding search might surface passages about retail economics that are semantically related but don't actually mention Costco.
BM25 (Best Matching 25) solves this because it operates on term frequency — if the user says "Costco", BM25 will surface documents that literally contain "Costco". The problem is that BM25 misses paraphrases and conceptual matches entirely.
The solution: run both in parallel and merge.
Reciprocal Rank Fusion
The merging strategy matters. You can't simply combine raw scores because BM25 scores and cosine similarity scores are on completely different scales. Reciprocal Rank Fusion (RRF) sidesteps this entirely by working on ranks rather than scores:
RRF_score(d) = Σ 1 / (k + rank_i(d)) for each retrieval method iThe constant \k\ (set to 60 in our config) controls how much weight rank position carries. A document ranked #1 by both methods gets a much higher fused score than one ranked #1 by one and #20 by the other.
This is elegantly simple — no score normalisation, no learned weights, no hyperparameter tuning. It just works, and it works well. The implementation is ~30 lines of Python.
Cross-Encoder Reranking
After RRF fusion, we have roughly 20 candidate documents. We need to select the top 5. A naive approach would take the top-5 RRF scores, but we can do better.
A cross-encoder jointly processes the (query, document) pair and produces a single relevance score. Unlike bi-encoder embeddings (which encode query and document independently), cross-encoders capture the interaction between query terms and document content.
I implemented the reranker as an LLM call — the same LLM we use for generation also serves as the relevance judge. This avoids deploying a separate cross-encoder model.
Query Rewriting
Conversational queries are often poor search queries. "What did he think about that?" is a valid conversational utterance but a terrible search query. The query rewriter transforms conversational language into retrieval-optimised form:
- "What did he think about that?" → "Charlie Munger views on technology investing"
- "Tell me more" → "Charlie Munger additional perspectives on mental models"
This is a single LLM call. Cost is minimal and the retrieval quality improvement is significant.
Deep Dive: Agentic RAG Router
This is the feature I'm most proud of, and the one that elevates the project from "standard RAG demo" to something architecturally interesting.
The Problem
Standard RAG treats every query identically: retrieve K documents, build context, generate. But not every query needs retrieval:
- "Hello, how are you?" — retrieval is wasteful and can actually hurt the response
- "What's the key takeaway?" — needs the last few turns of conversation context, not fresh retrieval
- "Compare Munger's and Buffett's views on tech" — needs multiple targeted retrievals, not one
The Solution: Adaptive Query Classification
The agentic router analyses each query and routes it to the optimal strategy:
class QueryType(str, Enum):
GREETING = "greeting" # No retrieval
SIMPLE = "simple" # Standard single-pass RAG
COMPLEX = "complex" # Multi-step decomposition
FOLLOWUP = "followup" # Context-aware with coreference resolutionClassification uses regex-based heuristics in rule-based mode (for fast, deterministic behaviour in demos and testing) and can be swapped for an LLM-based classifier in production.
Multi-Step Decomposition
When a query is classified as COMPLEX, the router decomposes it into sub-queries:
"Compare Munger and Buffett's views on tech investing"
└→ Sub-query 1: "What are Munger's views on tech investing?"
└→ Sub-query 2: "What are Buffett's views on tech investing?"Each sub-query runs through the full retrieval pipeline independently. Results are deduplicated and merged before generation. The LLM receives a richer, more targeted context than a single retrieval pass would provide.
Coreference Resolution
Follow-up queries often use pronouns: "What else did he say about that?" The router resolves these references by scanning recent conversation history for likely referents:
User: "Tell me about Munger's views on incentives"
Assistant: "Munger believed that incentives are the most powerful..."
User: "What else did he say about that?"
└→ Resolved: "What else did Charlie Munger say about incentives?"Pipeline Transparency
A key design decision: the agentic pipeline streams its reasoning alongside the response. The client receives SSE events like:
{"type": "agent_step", "step": "routing", "detail": "Query classified as complex"}
{"type": "agent_step", "step": "strategy", "detail": "Multi-step decomposition (2 sub-queries)"}
{"type": "agent_step", "step": "sub_query", "detail": "Retrieving: What are Munger's views on tech?"}This serves two purposes: UX (users see what the system is doing, building trust) and Debugging (pipeline behaviour is observable without log-diving).
Data Pipeline
Scraping and Ingestion
The project includes custom scrapers for each data source — fs.blog, Project Gutenberg, and curated quote collections. Each scraper fetches content with rate limiting, cleans and normalises text, performs semantic chunking (splitting by topic, not by character count), and stores chunks with metadata in ChromaDB.
Persona Configuration
Each persona is defined as a JSON configuration file containing system prompt, generation parameters, and metadata. The RAG pipeline is persona-agnostic — adding a new thinker requires only data ingestion and a JSON config file.
Technical Decisions and Trade-offs
Why No LangChain?
Three reasons:
- Understanding over abstraction. Building each pipeline stage from scratch forced me to deeply understand what's happening at every level.
- Debuggability. With a from-scratch approach, every failure points directly to my code.
- Minimal dependency surface. The entire backend has four core dependencies: FastAPI, ChromaDB, OpenAI SDK, and rank-bm25.
For a production system with a team and a deadline, I'd absolutely consider LangChain or LlamaIndex. The point here was learning, not shipping fast.
Why ChromaDB?
ChromaDB is the simplest vector database that works. It's a single pip install, stores data locally, and requires zero infrastructure. For production at scale, I'd evaluate Qdrant or Weaviate.
Mock Mode
The entire system can run without an API key. Mock mode provides pre-written responses that exercise the full pipeline path — including the agentic router, SSE streaming, and citation metadata — without making any external calls. The agentic router's rule-based classifier runs for real even in mock mode.
What I'd Do Differently
- Evaluation should come first. I added LLM-as-judge evaluation late. Without evaluation metrics, every change is guesswork.
- Semantic chunking needs more attention. Current chunking is adequate but not great — no overlap between chunks and no handling of cross-paragraph concepts.
- Stream pipeline stages. Currently, the user sees nothing until the LLM starts generating tokens. Streaming each stage's status would significantly improve perceived performance.
Conclusion
Building RAG Talk taught me that retrieval quality is the single biggest lever in a RAG system. A good retrieval pipeline with a mediocre LLM outperforms a mediocre pipeline with the best LLM. The agentic routing layer takes this further — by matching the retrieval strategy to the query type, the system handles a much wider range of conversational patterns gracefully.