Feb 2026/18 min

Building an Agentic RAG System from Scratch: Architecture, Hybrid Search, and Adaptive Query Routing

A deep dive into RAG Talk — a conversational AI system with hybrid BM25+vector search, Reciprocal Rank Fusion, cross-encoder reranking, and an intelligent agentic routing layer that adapts retrieval strategy to query intent.

RAGFastAPIChromaDBLLMArchitecture

Large Language Models are impressive, but they hallucinate. They confidently fabricate quotes, invent citations, and present fiction as fact. For any application where accuracy matters — and that's most of them — you need a way to ground LLM responses in real data.

That's the core premise behind RAG Talk, a conversational AI system I built from scratch that lets users have in-depth conversations with AI personas modelled after historical thinkers — Charlie Munger, Benjamin Franklin, Naval Ravikant, and others. Every response is grounded in the figure's actual writings, speeches, and interviews, with inline citations so users can verify claims.

What started as a straightforward RAG prototype quickly became a study in retrieval engineering: hybrid search, cross-encoder reranking, query decomposition, and ultimately an agentic routing layer that decides how to retrieve based on what the user is asking. This post walks through the architecture, the trade-offs, and the things I'd do differently.


Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│                        Frontend (Next.js)                     │
│   Chat UI  ←──  SSE Stream  ──→  Pipeline Metadata Display   │
└──────────────────────┬───────────────────────────────────────┘
                       │ POST /api/chat (SSE)
┌──────────────────────▼───────────────────────────────────────┐
│                    FastAPI Backend                             │
│                                                               │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │              Agentic Query Router                        │ │
│  │  classify_query() → GREETING | SIMPLE | COMPLEX | FOLLOW│ │
│  └────────┬────────────┬──────────┬────────────┬───────────┘ │
│           │            │          │            │              │
│     Direct LLM    Single-pass  Multi-step   Context-aware    │
│    (no retrieval)    RAG       Decompose     Coreference     │
│           │            │          │            │              │
│  ┌────────▼────────────▼──────────▼────────────▼───────────┐ │
│  │              Retrieval Pipeline                          │ │
│  │  Query Rewrite → Hybrid Search → RRF Fusion → Reranking │ │
│  └─────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘

Tech stack: Next.js 16 frontend, FastAPI backend, ChromaDB for vector storage, OpenRouter for LLM access, BM25 for sparse search, custom scrapers for data ingestion.


Deep Dive: The Retrieval Pipeline

Why Hybrid Search?

Early in development, I tested pure embedding search and immediately noticed a pattern: it was great at capturing semantic similarity but terrible at exact-match lookups. Ask "What did Munger say about Costco?" and embedding search might surface passages about retail economics that are semantically related but don't actually mention Costco.

BM25 (Best Matching 25) solves this because it operates on term frequency — if the user says "Costco", BM25 will surface documents that literally contain "Costco". The problem is that BM25 misses paraphrases and conceptual matches entirely.

The solution: run both in parallel and merge.

Reciprocal Rank Fusion

The merging strategy matters. You can't simply combine raw scores because BM25 scores and cosine similarity scores are on completely different scales. Reciprocal Rank Fusion (RRF) sidesteps this entirely by working on ranks rather than scores:

RRF_score(d) = Σ 1 / (k + rank_i(d))    for each retrieval method i

The constant \k\ (set to 60 in our config) controls how much weight rank position carries. A document ranked #1 by both methods gets a much higher fused score than one ranked #1 by one and #20 by the other.

This is elegantly simple — no score normalisation, no learned weights, no hyperparameter tuning. It just works, and it works well. The implementation is ~30 lines of Python.

Cross-Encoder Reranking

After RRF fusion, we have roughly 20 candidate documents. We need to select the top 5. A naive approach would take the top-5 RRF scores, but we can do better.

A cross-encoder jointly processes the (query, document) pair and produces a single relevance score. Unlike bi-encoder embeddings (which encode query and document independently), cross-encoders capture the interaction between query terms and document content.

I implemented the reranker as an LLM call — the same LLM we use for generation also serves as the relevance judge. This avoids deploying a separate cross-encoder model.

Query Rewriting

Conversational queries are often poor search queries. "What did he think about that?" is a valid conversational utterance but a terrible search query. The query rewriter transforms conversational language into retrieval-optimised form:

  • "What did he think about that?" → "Charlie Munger views on technology investing"
  • "Tell me more" → "Charlie Munger additional perspectives on mental models"

This is a single LLM call. Cost is minimal and the retrieval quality improvement is significant.


Deep Dive: Agentic RAG Router

This is the feature I'm most proud of, and the one that elevates the project from "standard RAG demo" to something architecturally interesting.

The Problem

Standard RAG treats every query identically: retrieve K documents, build context, generate. But not every query needs retrieval:

  • "Hello, how are you?" — retrieval is wasteful and can actually hurt the response
  • "What's the key takeaway?" — needs the last few turns of conversation context, not fresh retrieval
  • "Compare Munger's and Buffett's views on tech" — needs multiple targeted retrievals, not one

The Solution: Adaptive Query Classification

The agentic router analyses each query and routes it to the optimal strategy:

python
class QueryType(str, Enum):
    GREETING = "greeting"      # No retrieval
    SIMPLE = "simple"          # Standard single-pass RAG
    COMPLEX = "complex"        # Multi-step decomposition
    FOLLOWUP = "followup"      # Context-aware with coreference resolution

Classification uses regex-based heuristics in rule-based mode (for fast, deterministic behaviour in demos and testing) and can be swapped for an LLM-based classifier in production.

Multi-Step Decomposition

When a query is classified as COMPLEX, the router decomposes it into sub-queries:

"Compare Munger and Buffett's views on tech investing"
  └→ Sub-query 1: "What are Munger's views on tech investing?"
  └→ Sub-query 2: "What are Buffett's views on tech investing?"

Each sub-query runs through the full retrieval pipeline independently. Results are deduplicated and merged before generation. The LLM receives a richer, more targeted context than a single retrieval pass would provide.

Coreference Resolution

Follow-up queries often use pronouns: "What else did he say about that?" The router resolves these references by scanning recent conversation history for likely referents:

User: "Tell me about Munger's views on incentives"
Assistant: "Munger believed that incentives are the most powerful..."
User: "What else did he say about that?"
  └→ Resolved: "What else did Charlie Munger say about incentives?"

Pipeline Transparency

A key design decision: the agentic pipeline streams its reasoning alongside the response. The client receives SSE events like:

json
{"type": "agent_step", "step": "routing", "detail": "Query classified as complex"}
{"type": "agent_step", "step": "strategy", "detail": "Multi-step decomposition (2 sub-queries)"}
{"type": "agent_step", "step": "sub_query", "detail": "Retrieving: What are Munger's views on tech?"}

This serves two purposes: UX (users see what the system is doing, building trust) and Debugging (pipeline behaviour is observable without log-diving).


Data Pipeline

Scraping and Ingestion

The project includes custom scrapers for each data source — fs.blog, Project Gutenberg, and curated quote collections. Each scraper fetches content with rate limiting, cleans and normalises text, performs semantic chunking (splitting by topic, not by character count), and stores chunks with metadata in ChromaDB.

Persona Configuration

Each persona is defined as a JSON configuration file containing system prompt, generation parameters, and metadata. The RAG pipeline is persona-agnostic — adding a new thinker requires only data ingestion and a JSON config file.


Technical Decisions and Trade-offs

Why No LangChain?

Three reasons:

  1. Understanding over abstraction. Building each pipeline stage from scratch forced me to deeply understand what's happening at every level.
  2. Debuggability. With a from-scratch approach, every failure points directly to my code.
  3. Minimal dependency surface. The entire backend has four core dependencies: FastAPI, ChromaDB, OpenAI SDK, and rank-bm25.

For a production system with a team and a deadline, I'd absolutely consider LangChain or LlamaIndex. The point here was learning, not shipping fast.

Why ChromaDB?

ChromaDB is the simplest vector database that works. It's a single pip install, stores data locally, and requires zero infrastructure. For production at scale, I'd evaluate Qdrant or Weaviate.

Mock Mode

The entire system can run without an API key. Mock mode provides pre-written responses that exercise the full pipeline path — including the agentic router, SSE streaming, and citation metadata — without making any external calls. The agentic router's rule-based classifier runs for real even in mock mode.


What I'd Do Differently

  1. Evaluation should come first. I added LLM-as-judge evaluation late. Without evaluation metrics, every change is guesswork.
  2. Semantic chunking needs more attention. Current chunking is adequate but not great — no overlap between chunks and no handling of cross-paragraph concepts.
  3. Stream pipeline stages. Currently, the user sees nothing until the LLM starts generating tokens. Streaming each stage's status would significantly improve perceived performance.

Conclusion

Building RAG Talk taught me that retrieval quality is the single biggest lever in a RAG system. A good retrieval pipeline with a mediocre LLM outperforms a mediocre pipeline with the best LLM. The agentic routing layer takes this further — by matching the retrieval strategy to the query type, the system handles a much wider range of conversational patterns gracefully.

Source code on GitHub