Home / AI Arena / Building AI Agents / RAG Strategies

RAG Strategies

7 min read ai-arena LangChain RAG Python

This is part of the AI Agents series. All code is at github.com/achintmehta/langchain.

What is RAG?

Retrieval-Augmented Generation (RAG) is the technique of giving an LLM access to relevant documents at the time it generates an answer, rather than relying on knowledge baked into its weights. You retrieve relevant chunks from your vector database, include them in the prompt, and ask the model to answer based on them.

The advantages are significant: the model can answer questions about private data it was never trained on, it can cite sources, and you can update the knowledge base without retraining anything. The disadvantages are that the answer quality is only as good as retrieval quality, and you are consuming context window space with the retrieved chunks.

Naive RAG, the baseline

The simplest possible pipeline:

Embed the user's question.
Do a similarity search against your vector database to get the top-k chunks.
Put those chunks + the question into a prompt.
Send to the LLM and return the answer.

In LangChain:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retriever = db.as_retriever(search_kwargs={"k": 4})

rag_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful assistant. Answer the question using only
the context below. If the answer is not in the context, say so.

Context:
{context}"""),
    ("human", "{question}")
])

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

answer = chain.invoke("What is the capital of France?")
print(answer)

RunnablePassthrough() is a LangChain component that passes its input through unchanged. Here it routes the raw question to the prompt's {question} slot while the retriever | format_docs branch handles the {context} slot. Both branches run in parallel before the prompt is assembled.

This naive baseline works well for simple, precise queries against focused document sets. Its weaknesses show when the query is ambiguous, when the answer spans multiple chunks, or when vocabulary in the query doesn't match the phrasing in the documents.

RAG Fusion, multiple queries, merged results

One of the most impactful improvements over naive RAG is to generate multiple versions of the query before retrieval. Different phrasings retrieve different relevant chunks; combining them gives better coverage.

Reciprocal Rank Fusion (RRF) is the standard way to merge ranked lists from multiple queries. A chunk that appears in the top-5 results for three different queries gets a higher fused score than one that only appears in one.

from langchain_core.prompts import ChatPromptTemplate

multi_query_prompt = ChatPromptTemplate.from_template(
    """Generate five different versions of the following question.
Each version should approach the same information need from a different angle.
Output only the five questions, one per line.

Original question: {question}"""
)

# Generate five queries, retrieve for each, deduplicate and fuse
def reciprocal_rank_fusion(results_list, k=60):
    fused_scores = {}
    for results in results_list:
        for rank, doc in enumerate(results):
            key = doc.page_content
            if key not in fused_scores:
                fused_scores[key] = {"doc": doc, "score": 0}
            fused_scores[key]["score"] += 1 / (rank + k)
    return sorted(fused_scores.values(), key=lambda x: x["score"], reverse=True)

RAG Fusion is particularly useful for questions that have multiple valid phrasings, common in customer support and knowledge bases where users ask the same question many different ways.

RAPTOR, recursive summaries for long documents

RAPTOR addresses a structural problem: with naive RAG you can retrieve individual chunks, but you lose the big picture. A user asking "summarise the main conclusions of this report" will get back detail-level chunks rather than a high-level overview.

RAPTOR builds a tree of summaries over your corpus. The leaves are your original chunks. The model clusters similar chunks and summarises each cluster; those summaries become new nodes one level up. The process repeats until you have a single root summary.

At retrieval time, you can search at any level of the tree. A high-level question retrieves from the upper tree; a specific factual question retrieves from the leaves. Retrieval can also start at a high level and descend.

This works best for long, structured documents, technical manuals, research papers, legal agreements, where both high-level and detailed queries need to be supported.

GraphRAG, knowledge graphs for multi-hop questions

Standard vector similarity search finds chunks that are locally similar to the query. It struggles when the answer requires combining information from multiple places in the corpus that are not individually similar to the query, the "multi-hop" problem.

GraphRAG adds a knowledge graph layer. During indexing, an LLM extracts entities (people, products, concepts) and relationships (works_for, depends_on, contradicts) from chunks, building a graph. At query time, vector search finds a starting set of nodes, and graph traversal expands outward along edges to collect related context.

This makes GraphRAG much stronger for questions like "Which engineers worked on the project that uses Library X?", a question that requires hopping through multiple document relationships.

Agentic RAG, retrieval inside an agent loop

Rather than running retrieval as a fixed pipeline step, agentic RAG wraps the whole thing in a LangGraph agent loop. The agent can:

Try a retrieval, decide the results are insufficient, and retry with a rephrased query.
Decompose a complex question into sub-questions, retrieve for each, and synthesise.
Fall back to a web search if the document store doesn't have the answer.
Combine document retrieval with SQL queries or API calls when the question spans both.

This is covered in more depth in the LangGraph and agents part of this series.

Self-RAG and CRAG, evaluating retrieval quality

Even with good chunking and embedding, retrieval sometimes returns chunks that are not actually relevant to the question. Self-RAG and CRAG add an evaluation step that scores each retrieved chunk for relevance before it is sent to the LLM.

If the scores are low, the pipeline can re-retrieve with a different query, fetch from a different source, or fall back to a parametric answer (the model's own knowledge). This self-correction loop significantly improves answer quality on questions where naive retrieval would return poor context.

In LangGraph this is straightforward to implement as a conditional edge: after the retrieval node, route to a grader node; if the grade is below a threshold, loop back to re-query.

HyDE, hypothetical document embeddings

An interesting observation: embedding a question and embedding an answer to that question produce vectors that are not always close together, because the question is phrased as a question and the answer is phrased as a statement. HyDE exploits this.

Instead of embedding the raw question, you ask the LLM to write a hypothetical answer, a plausible document that would answer the question, even if the facts in it might be wrong. You embed that hypothetical answer and use it for retrieval. A plausible answer is closer in embedding space to the real answer chunks than the bare question is.

hyde_prompt = ChatPromptTemplate.from_template(
    """Write a short paragraph that would answer the following question.
It does not have to be factually correct, just plausible and relevant.
Question: {question}"""
)

# Generate a hypothetical answer, embed it, then retrieve
hyde_chain = hyde_prompt | llm | StrOutputParser()
hypothetical_doc = hyde_chain.invoke({"question": user_question})

# Use hypothetical_doc as the retrieval query instead of the raw question
results = retriever.invoke(hypothetical_doc)

HyDE works best when the query and document vocabulary are very different, for example, searching a formal technical corpus with informal user language.

What's next

Better retrieval starts with better queries. The next part covers query transformation techniques, rewriting, multi-query, logical routing, and more, that you can apply before retrieval to improve what comes back.