Docs

Schedule a demo

Jun 5, 2025

Level Up Your GenAI Apps: What’s Next for RAG

Maria Khalusova

RAG

Every few weeks a new wave of discourse hits: “RAG is dead!” some claim pointing to larger context windows, AI agents or “agentic search”. But the reality? RAG is more indispensable than ever.

At Unstructured, we see firsthand that search and foundational data work remain crucial. It’s tempting to believe that bigger models or smarter agents will handle everything, but in practice, success almost always comes down to clean, well-structured data and a reliable retrieval pipeline. That’s not going away.

In this final blog post of the series (Part 1, Part 2, Part 3, Part 4, Part 5), let’s talk about where RAG is headed: what’s trending, what’s changing, and which foundations you still need to get right.

Ever Increasing Context Windows are Not a Cure-all

LLMs today support massive context windows, e.g. GPT-4.5 takes ~128K tokens, Claude 4 up to 200K. That’s impressive. But does that kill RAG? Not quite. As we laid out in one of our previous posts, larger context windows allow you to fit more data into a single prompt, but they introduce noise, complexity, and cost. And crucially, they still fall short of accommodating the full breadth of data that real enterprise systems deal with.

Recent experiments show that combining just a few large documents in a prompt can actually harm performance. For example, Snowflake’s research found that stuffing ~14,400-character chunks into a 200K-token model (Claude Sonnet) degraded retrieval accuracy by bundling irrelevant details. Even with a 200K-window, Claude Sonnet suffered “context confusion” if chunks were too big and unfocused. In practice, smaller, thematic chunks (≈1,800 characters) plus retrieving more of them (e.g. top-50) worked best. The lesson? More isn’t always better. You still need a thoughtful chunking strategy and a retrieval layer tuned for precision and relevance. At Unstructured, we’ve long advocated for intelligent preprocessing: parse your documents well, chunk them semantically, preserve structure, and tag them with useful metadata.

And let’s not forget the cost. Premai’s analysis shows RAG is far more efficient when the task doesn’t require the full long context: you retrieve only what’s needed, keep the prompts tight, and save on inference cost.

Bottom line: bigger contexts help, but they shift the tradeoffs, not remove them. You still need clean, intelligently processed data. You still need retrieval.

The Agentic Revolution

Increasingly, “RAG” isn’t just a simple fetch step but a part of an agentic pipeline.

Agentic applications are on the rise: instead of a static pipeline, you have LLMs acting as planners that can break down a query, call tools like a retriever, a calculator, or a database, and then synthesize a response. For instance, LangChain’s retrieval agent example shows how an LLM can dynamically decide when and how to use search.

But here’s the catch: an agent is only as good as the tools and data available to it. Even the smartest planner fails if it’s retrieving from a poor-quality index or working with bad document chunks. That’s why RAG, and especially in agentic pipelines, still depends on the fundamentals:

Meaningful chunks
Rich metadata
High-quality indexing
Optimized retrieval

We expect agent-based RAG flows to keep growing, especially for complex multi-hop tasks. But this doesn’t make RAG obsolete, on the contrary, it makes better RAG even more important.

Multimodal RAG: Beyond Just Text

Enterprises don’t deal in text only. Knowledge lives in images, charts, tables, scans, and even videos or audio. Multimodal RAG extends traditional pipelines to retrieve and ground generation on non-text content, then synthesize it using multimodal models like GPT-4o or Claude 4.

To make this work, you need infrastructure that can:

Parse and chunk multimodal content (e.g. slides, scanned docs, audio, video)
Embed and index across formats
Retrieve meaningfully regardless of modality
Route to models that can interpret what’s returned

Unstructured already supports many of the core capabilities required for effective multimodal RAG. Our pipelines handle layout-rich documents like PDFs, slide decks, and web pages with structure and context intact, rather than flattening them into plain text. We also support audio and video partitioning for select customers, with outputs integrated into the same processing workflows used for text. While multimodal RAG is still gaining traction across the industry, these capabilities are already in active use today within Unstructured pipelines.

Memory: The Next Layer of Context

Most RAG systems today are stateless: they answer one query at a time. But real enterprise workflows span sessions, decisions, and evolving user needs. Increasingly, we’re seeing the demand for memory: systems that can retain and recall relevant context over time.

In cognitive science, memory is often described in layers: short-term, working, and long-term memory. Each layer has a different role in learning and reasoning. A similar framing is emerging in LLM-based systems: short-term memory lives in the context window, working memory involves chaining or agents, and long-term memory encompasses persistent knowledge of past interactions.

But real memory isn’t just saved chat logs. It’s about continuity and the ability to re-engage with a user’s goals over time. Someone should be able to ask, “Remind me what we decided last week,” or “Show me the updated version of that report from Q3 instead of Q2,” and get a grounded, scoped response.

Designing memory systems raises hard questions:

What information is worth remembering?
When should memory be recalled versus ignored?
How do we keep memory secure, scoped, and up to date, especially as documents change or access controls shift?

RAG and memory solve different, but complementary, problems. RAG ensures your system retrieves what’s currently true based on the latest knowledge. Memory adds the temporal and personal dimension, helping users pick up where they left off. The most capable systems will do both: retrieve what’s true now, and remember what was relevant before.

Enterprise RAG Adoption (Data, Controls, Ops, Evals)

Even as RAG gets smarter, its adoption hangs on fundamentals. In enterprise deployments, RAG doesn’t just need to work - it needs to be trustworthy, explainable, and secure. That starts with better data work, but goes far beyond. In practice, your mantra should be: garbage in, garbage out. The best models and the most clever prompts can’t compensate for messy or missing data.

Data Preprocessing & Chunking

Clean, well-chunked data is non-negotiable. Documents (PDFs, HTML, tables, etc.) must be parsed and split into semantically meaningful chunks. Preserving structure in your chunks gives LLMs a clear context, reduces noise, and tends to greatly improve downstream performance.

Metadata

Indexes should carry rich metadata (document type, source, date, security label, etc.) and RAG queries should filter on it. For instance, an employee user might only retrieve from project-specific docs. This isn’t optional in enterprise: it’s common to filter retrieval by user role or other attributes before even embedding. Hybrid indexes help here too: full-text search engines can enforce filters efficiently. Proper tagging and filters keep LLMs from pulling irrelevant or unauthorized content.

The Future is Identity-Aware Retrieval

Enterprises need retrieval that respects access boundaries: an engineer shouldn’t retrieve sales contracts. But most RAG pipelines today treat the user as invisible. That’s a problem.

We expect to see rapid development in identity-aware RAG. Retrieval should integrate with IAM systems. Chunks should carry access control tags. Queries should be filtered not just by content, but by who’s asking.

No one has fully solved this yet but we expect to see it coming. Identity-aware RAG is what can stand between a prototype and a production application.

Structured + Unstructured Fusion

Real enterprise knowledge lives in both structured databases and unstructured content. Future RAG pipelines will increasingly blend the two.

At Unstructured, we’re working to make this seamless. Our platform can preprocess both structured and unstructured data in the same workflow, standardizing the outputs. That gives your downstream system a consistent and high-quality view of enterprise knowledge.

Observability & Evaluation

You can't improve what you don't measure. The most successful RAG deployments establish robust evaluation systems that create a flywheel for rapid iteration and improvement.

System metrics: Latency, throughput, retrieval time, prompt size
Quality metrics: Precision, recall, chunk relevance scores, context utilization
Cost metrics: Token usage breakdown, retrieval vs. generation costs

Look at your data, examine traces of your pipelines to understand what causes failures. Build out your own evaluations specific to your use case. The truth is, evaluating RAG systems is hard and manual “vibe checking” doesn’t scale. A common solution is model-based evaluation (LLM-as-a-judge), however, it needs to be carefully aligned with human judgment and preferences that are specific to your use case.

Latency and Cost Trade-offs

Finally, keep an eye on efficiency. More retrieved chunks improve answer quality up to a point, but increase prompt size and latency. Sometimes a smaller LLM with a well-pruned RAG pipeline will outperform a brute-forced huge LLM. Evaluate end-to-end latency for your SLOs, and be prepared to cache embeddings or warm the index if needed.

Final Takeaways

RAG is not dead, it is entering its next chapter. Yes, context windows are expanding. Yes, agents are getting smarter. But the foundations still matter:

Clean, well-chunked data
Smart metadata and filtering
Observability and evaluations
Cost-effective architecture

At Unstructured, we believe strong foundations are what make GenAI systems truly production-ready. The future of GenAI will be defined by those who get this layer right and we’re enabling that future by continuously building a data foundation that’s easy to use, scalable, and enterprise-grade.

If you want to build GenAI that works securely, efficiently, and at scale, start with your data.