AI Engineering

PM Prep AI: A RAG System That Actually Adapts

A retrieval-augmented system for PM interview prep with adaptive query routing, hybrid retrieval with Reciprocal Rank Fusion, and semantic caching.

Watch Demo View on GitHub

The Problem

Most RAG tutorials teach you the same three steps: embed your documents, retrieve the top-K most similar chunks for any query, stuff them into a prompt. It works for demos. It falls apart the moment you ask it real questions.

PMs ask wildly different kinds of questions. What is the RICE framework? is nothing like Compare RICE vs ICE, which is nothing like How would you prioritize features for a two-sided marketplace? A single retrieval strategy cannot serve all three well. The first needs a tight, precise answer. The second needs balanced coverage of two entities. The third needs both frameworks and applied examples.

So I built PM Prep AI - a RAG system that classifies each query and routes it through a different retrieval strategy. The difference between a system that does retrieval and one that thinks about how to retrieve.

Architecture at a Glance

The system is a LangGraph state machine sitting behind a FastAPI backend. The pipeline has four meaningful stages: cache check, query classification, adaptive retrieval, and generation. Qdrant handles vector storage, Redis handles semantic caching, and Gemini Flash powers both the classifier and the generator.

What Makes This Interesting

1. Query Routing - Six Strategies, One Classifier

Before retrieval even starts, the query goes through a classifier (Gemini Flash, temperature=0 for determinism) that assigns it to one of six types: framework_lookup, opinion_query, comparison, interview_scenario, general, or followup. Each type triggers a completely different retrieval strategy.

A framework_lookup query gets semantic search filtered to concept-type chunks, returning 3-5 results. Precise, tight, cheap.
A comparison query parses the entities being compared and runs separate retrieval passes for each, then merges results - 10-15 chunks total so both sides are represented.
An interview_scenario query runs a two-step chain: first retrieve relevant frameworks, then use those to search for concrete examples - 15-20 chunks.

The point is adaptive context sizing. Precision queries get tight context windows. Synthesis queries get wide ones. A fixed top-K either over-fetches for simple questions or under-fetches for complex ones.

2. Hybrid Retrieval with Reciprocal Rank Fusion

Vector search alone misses exact terms. BM25 alone misses semantic matches. Neither is sufficient on its own.

Every query runs through both: a vector search via Qdrant (top 20 by cosine similarity) and a BM25 keyword search via rank_bm25 (top 20 by term frequency). The two ranked lists get merged using Reciprocal Rank Fusion with k=60.

RRF only cares about rank position, not raw scores. The formula gives a strong boost to results that appear in both lists, with no hyperparameter tuning needed. Then an optional reranking pass with Cohere cross-encoder provides a much more accurate relevance score as a second-stage filter. If Cohere is not configured, the system falls back to RRF ordering gracefully.

3. Semantic Caching

Should Compare RICE vs ICE and How do RICE and ICE differ? be cache hits for each other? Obviously yes. But they share zero overlapping characters as a cache key - any hash-based cache treats them as completely different queries.

The solution: embed the query, store the embedding alongside the answer in Redis, and on every new query compare its embedding to cached ones using cosine similarity. If similarity exceeds 0.95, return the cached answer. Cache TTL is 24 hours, bounded to 500 most recent entries.

Evaluation - The Part Most Portfolio Projects Skip

There is a 50-question evaluation dataset spanning all six query types, run through RAGAS. It scores four metrics: faithfulness (does the answer hallucinate?), answer relevancy (does it address the question?), context precision (are retrieved chunks relevant?), and context recall (do they cover the reference answer?).

I also track routing accuracy separately. Routing errors cascade - a misclassified query gets the wrong retrieval strategy, which corrupts everything downstream. It is the single most important failure point to monitor, and invisible in standard RAG metrics. The eval suite runs in ~15 minutes and caught multiple regressions during iteration.

Decisions I Would Defend

Why not GraphRAG? PM knowledge is genuinely relational. A knowledge graph would handle relational queries far better than my multi-retrieval workaround. But building quality graphs requires manual curation or expensive LLM-based entity extraction. The multi-retrieval approach is explicitly an approximation of what graph traversal would handle natively. GraphRAG is the number one item on my future improvements list.
Why Gemini? Single API key for both embeddings and generation, generous free tier, and Flash model speed and cost is unbeatable for high-volume classification calls.
BM25 in-process instead of Elasticsearch? Another infrastructure dependency for marginal gains on a small corpus. rank_bm25 handles the 10-source corpus instantly in Python memory.

The Key Lesson

The real lesson was not about any single technique - it was that the gap between RAG works and RAG works well is almost entirely about adapting retrieval to the question being asked. Most RAG failures are retrieval failures masquerading as model failures. You retrieve the wrong chunks, generate the wrong answer, and blame the LLM.

Building this convinced me that the routing layer - deciding how to retrieve before retrieving - is the highest-leverage place to spend engineering effort in any RAG system.

Explore more product teardowns and case studies