May 8, 2025
Level Up Your GenAI Apps: Overview of Advanced RAG Techniques
Maria Khalusova
RAG
This is the second post in our series on advanced Retrieval-Augmented Generation (RAG) techniques. In the first post of this series, we explored how naive RAG pipelines relying only on basic chunking and semantic search, can fall short when it comes to complex queries, ambiguous terms, or messy real-world data. We also argued that data preprocessing isn't just a cleanup step, but a foundation for any successful RAG implementation.
In this next part, we are turning our attention to retrieval itself. When the right information isn't retrieved or is buried among irrelevant noise no amount of clever prompting will save your RAG. From hybrid retrieval to semantic reranking and beyond, we'll go over the techniques that help you retrieve not just similar content, but relevant content.
Finding the Needle: Smarter Retrieval & Ranking
Even with a well-formed query, initial retrieval typically via vector similarity can return irrelevant results or miss the most useful documents. The following techniques aim to improve both recall and precision, ensuring the right context reaches the generator LLM.
Re-ranking
Re-ranking introduces a second pass after the initial retrieval. First, the similarity search retrieves a large candidate set, say the top 100 chunks. Then a cross-encoder model or an LLM, re-evaluates and re-scores those candidates based on their relevance to the query. Cross-encoders and LLMs outperform embedding-based similarity search because they evaluate relevance by jointly processing both the query and the candidate documents, rather than comparing precomputed embeddings. You can learn more about differences between embedding models and cross-encoders in this earlier blog post.
This two-step process can significantly improve the accuracy of the top few results passed to the LLM. It helps mitigate some of the limitations of pure vector similarity, which can miss subtle semantic signals.
Fortunately, re-ranking is easy to incorporate into your RAG. There are plenty of plug-and-play options: from hosted APIs like Cohere or Together AI to lightweight, open-source models like mxbai-rerank (on Hugging Face) or local options via Ollama. Your choice will depend on tradeoffs like latency, cost, privacy, and infrastructure preferences.
While re-ranking can dramatically improve the relevance of the final set of the retrieved chunks, it still relies on the quality of the initial candidate set from the vector search. But what happens when that first retrieval step does not return enough relevant candidates? That’s where hybrid search can potentially help.
Hybrid Search
Even with re-ranking, pure vector search doesn't always cut it, especially when queries contain out-of-vocabulary terms, acronyms, or rare domain-specific words. Also, while embeddings are great at capturing semantic meaning, they can overlook exact matches if they are not in the vocabulary.
That’s where BM25, a tried-and-true keyword-based search, excels. BM25 scores how relevant a document is to a search query based on a few key principles:
Term Frequency (TF): How often each search term appears in a document.
Inverse Document Frequency (IDF): Rare terms carry more weight, so documents with uncommon matches rank higher.
Document Length Normalization: Longer documents don’t automatically get higher scores just because they contain more words.
Query Term Saturation: Repeating the same keyword a dozen times won’t unfairly boost relevance.
BM25 is great at surfacing document chunks where exact matches act as high-precision signals. However, since BM25 doesn’t capture semantic meaning, unlike the vector similarity search, it can miss relevant documents due to vocabulary mismatch.
Hybrid search combines both. You run keyword and vector searches in parallel and merge or re-rank their results, often using Reciprocal Rank Fusion. This technique helps cover a wider range of query types, improving robustness and recall without having to sacrifice the semantic relevance.
Many vector databases and search engines now support hybrid search natively, including Weaviate, Elastic, Milvus, Astra DB, and others. Enabling it is usually as simple as toggling a hybrid mode and adjusting how you want to weigh the results of keyword and vector search in the final ranking.
Hybrid search helps you cast a wider net by combining the strengths of keyword and semantic search, but sometimes, the smartest move isn’t casting wider. It’s narrowing it down before you search.
Metadata Pre-Filtering
Sometimes the best way to boost retrieval performance is by searching smarter. Before you even get to semantic similarity, narrowing the search space with metadata filters can dramatically improve both precision and relevance.
Consider financial filings like SEC 10-Ks or 10-Qs. These documents follow consistent templates, and the language used across different companies or reporting periods is often nearly identical. Without constraints, a semantic search looking for “risk factors in a 2023 filing for Company A" could easily return a section from a 2020 filing, or worse, from an entirely different company. The cosine similarity detects similar phrasing and returns the documents as a match. Technically, yes, they are similar, but practically, they just add noise.
Metadata filters let you avoid that. By pre-filtering based on structured attributes like company name, document type, author, or filing year, you can narrow down the candidate pool before running vector search. This ensures that semantic similarity operates within the right context, improving both accuracy and trust in the results.
Of course, this all relies on having rich, reliable metadata in the first place. Unstructured automatically extracts and attaches key metadata, like filename, source, date, and more, during the document partitioning step. It’s captured right up front, so you can use it later to filter precisely. And if your use case calls for additional metadata, you can add it as a step in your workflow using custom plugins in your own Unstructured deployment, giving you as much flexibility as you need.
When used well, metadata filtering is one of the most efficient ways to boost retrieval quality. Oftentimes it can pay off much quicker than other methods.
Parent-Document Retrieval
One challenge in RAG design is balancing chunk size. Small chunks tend to improve retrieval precision but can strip out essential context. Larger chunks preserve that context but often dilute relevance with extraneous content.
A practical middle ground is to index small, semantically meaningful chunks while enriching each with metadata linking back to their broader context—such as the page or parent document they originated from. At query time, once a relevant chunk is found via vector similarity, you can expand the context by retrieving all other chunks from the same page. For short-form documents, it may even be optimal to retrieve the entire file.
Conceptually, this method requires a two-step retrieval pattern: vector search followed by metadata-based filtering, which is available in DBs like Elasticsearch and MongoDB Atlas, for example, with optimizations to make it efficient. These systems allow you to retrieve a chunk via similarity search, then use structured metadata fields (e.g., filename, page_number,) to retrieve all related chunks. You can learn more about Parent-Document retrieval for RAG in this blog post by MongoDB.
When implementing this type or retrieval you’ll need to decide on expansion granularity (e.g., entire document, single page, or multiple pages window) based on the typical structure and length of your documents. In addition to that, it’s a good idea to double check that your database deduplicates the expanded results, because multiple retrieved chunks may come from the same page/document.
Unstructured enables this retrieval strategy out of the box. During preprocessing, it enriches every chunk with metadata such as the original filename, page number, and more. This allows you to flexibly expand retrieval from a single chunk to a page or full document without additional processing.
Asking Better Questions: Query Transformation Tricks
Many retrieval issues stem from the query itself. It might be ambiguous, too short, or simply phrased in a way that doesn't align with how the source content is written. Query transformation techniques fix this by using an LLM before retrieval to reframe the query for better results.
Hypothetical Document Embeddings (HyDE)
Instead of embedding the original query, HyDE first asks an LLM to generate a hypothetical answer to the question. That answer is then embedded and used for vector search.
Why does this help? Because queries are often abstract, while documents are concrete. By turning a vague question into a plausible answer, HyDE moves the query closer in form to what's actually in your data. It's especially effective in niche or underspecified domains.
As with most retrieval techniques, HyDE performs best when your data preprocessing yields semantically clean, coherent chunks. If the underlying embeddings are poor, generating a better query vector won't be enough to recover quality.
Query Rephrasing
Sometimes all it takes to improve retrieval is a little rewording. An LLM can rephrase a question into alternate forms that may better match the terminology or phrasing used in your documents.
Think of it like this: "How is inflation affecting consumer sentiment?" might not match well with documents that say "impact of rising prices on shopper behavior." Rephrasing helps bridge that gap.
It's lightweight, easy to implement, and particularly useful in domains with lots of jargon or domain-specific language.
Query Decomposition
Some questions are inherently complex. They require multiple steps of reasoning or pull from different parts of the knowledge base. Query decomposition tackles this by using an LLM to break a single, high-level query into a sequence of simpler sub-questions. Each sub-query is executed in turn, and intermediate results can inform subsequent retrieval steps.
For example, consider financial documents again, and a query like “Compare the 2023 earnings of Walmart and Costco.” SEC filings are structured such that each company has its own standalone 10-K report—there’s no single document that answers this question directly. With query decomposition, the system can first generate two sub-queries: “What were Walmart’s earnings in 2023?” and “What were Costco’s earnings in 2023?” Each sub-query is handled independently, retrieving the relevant sections from their respective filings. The answers are then synthesized to provide a clear comparison.
This method helps manage multi-hop reasoning, where the system needs to gather pieces of evidence from different places and synthesize them into a final answer.
By generating hypothetical documents, query variants, or decomposed sub-queries, you can create better inputs for the retrieval engine and optimize the search before retrieval.
RAG Gets Smart: Adaptive and Agentic Approaches
At this point, we’ve already moved beyond basic one-shot retrieval. Techniques like re-ranking, hybrid search, parent-document expansion, and query decomposition all introduce additional logic which can be re-evaluating results, expanding context, or breaking down complex queries into simpler pieces. These approaches enhance retrieval quality, but they still operate within a largely fixed pipeline.
To go further, you can explore GraphRAG and Agentic RAG. Both push past traditional retrieval patterns, but in different ways. Let’s explore them.
Graph RAG
Most RAG pipelines operate over a flat index of semantically embedded text chunks. GraphRAG, in contrast, uses a Knowledge Graph (KG) as the backbone of its retrieval system, adding structure, meaning, and relational context to the search process. In a knowledge graph, information is modeled as entities (nodes) and relationships (edges), enabling the system to retrieve not just isolated facts but entire webs of connected information.
When a query is issued, GraphRAG identifies relevant entities and traverses their neighboring nodes to extract a subgraph of related context. This yields results that are not only semantically relevant but also structurally coherent. A key advantage of this approach is traceability: GraphRAG can potentially surface the exact reasoning path through the graph, making its answers more explainable and trustworthy.
Building a high-quality knowledge graph from unstructured data is challenging. It typically involves entity recognition, relationship extraction, and schema definition. Unstructured’s Named Entity Recognition (NER) enrichment helps kick-start this process by extracting structured components such as entities and relationships directly from the documents raw text. Once constructed, a knowledge graph can be queried directly or combined with traditional vector search in a hybrid retrieval setup.
Unstructured natively integrates with traditional graph databases like Neo4j, as well as systems like AstraDB, which support lightweight, dynamic knowledge graphs, making it easy to incorporate graph-based retrieval into your RAG pipeline.
Agentic RAG
Where GraphRAG adds structured data, Agentic RAG introduces structured control.
Instead of running a fixed pipeline query → retrieve → generate, Agentic RAG deploys LLM-based agents that dynamically decide how to proceed. These agents can adapt their strategy, orchestrating retrieval and reasoning steps tailored to the task at hand.
Agentic RAG systems can:
Reformulate ambiguous or underspecified queries
Choose among tools (e.g., retriever, external APIs like web search)
Chain multiple retrievals and generations to build composite answers
Critique intermediate outputs and revise the plan mid-execution
Skip retrieval entirely when the answer is trivial or cached
In essence, the system starts to reason about the process itself. This enables capabilities far beyond traditional RAG, from multi-hop queries to open-ended exploration. Instead of just retrieving facts, Agentic RAG systems can navigate information landscapes, applying different tools at each step to achieve better outcomes.
However, this flexibility comes with tradeoffs. Agentic RAG tends to be more resource-intensive. To operate effectively, agents typically require:
Larger context windows, to maintain memory across multiple steps
Advanced reasoning or coding capabilities, if they're expected to use tools like calculators, APIs, or code execution environments
Stronger error handling mechanisms, since mistakes in can propagate and compound across the workflow
You’ll also see higher token usage and latency, as multi-step processes accumulate more API calls and intermediate generation steps. That means higher cost and longer response times. And regardless of how clever the agent is, no reasoning strategy can recover missing or poorly indexed data. If relevant content isn’t available—or isn’t chunked, filtered, or tagged with metadata correctly—then even the best Agentic RAG system will fall short.
That’s why high-quality preprocessing matters more than ever. Unstructured simplifies the critical data preparation phase, enabling fine-grained partitioning, smart chunking, automatic metadata enrichment, and custom workflows. With well-structured data, your agent has a clear map to follow. Without it, they’re navigating blind.
Conclusion
As we've explored throughout this post, you have a whole arsenal of advanced retrieval techniques at your disposal to enhance the performance of your RAG systems. From re-ranking and hybrid search to parent-document retrieval and query transformation, each approach addresses specific challenges that simple vector similarity alone cannot solve. The more complex your use case becomes, the more these techniques prove their worth.
While we've focused on enhancing retrieval in this post, even the most sophisticated retrieval techniques rely on a solid foundation of well-processed data. That's why in the next part of this series we’ll talk about data preprocessing. Stay tuned for our next blog that will look into unstructured data partitioning, chunking, metadata, enrichments and more.