Best Practices for Implementing RAG Systems in Production

Retrieval Augmented Generation

RAG Systems Best Practices: Unstructured Data Pipeline

Feb 27, 2026

Authors

Unstructured

Authors

Unstructured

RAG Systems Best Practices: Unstructured Data Pipeline

This article breaks down how to build and run a production RAG pipeline, from offline ingestion and indexing through chunking, retrieval, reranking, evaluation, and security controls that keep answers grounded and auditable. If you want a more standard ingestion layer, Unstructured helps you turn messy enterprise documents into clean, permission-aware JSON chunks and metadata that load reliably into your vector index and downstream LLM workflows.

RAG system workflow and modules

A RAG system is a retrieval augmented generation system that retrieves relevant source content and then asks a large language model (LLM) to generate an answer grounded in that content. This means you treat the system as an information pipeline with clear interfaces, not as a single prompt.

The most reliable implementation pattern separates two workflows: an offline indexing workflow that prepares data for search, and an online retrieval workflow that answers queries. This separation matters in production because you can change ingestion and data quality without changing the runtime path that serves users.

A practical mental model is that the offline workflow builds a searchable memory, and the online workflow assembles context for a single response. When either workflow is ambiguous, you lose debuggability because you cannot tell whether failures come from data preparation, retrieval, or generation.

Most RAG requirements map to a small set of modules, and each module should produce outputs that are easy to log, test, and replay.

Connectors: Connectors pull content from systems of record and preserve identifiers and access rules. This reduces data drift because you can trace every chunk back to a source object and permission set.
Parsers: Parsers convert files into structured elements like titles, paragraphs, tables, and images. This preserves layout signals that later help chunking and retrieval.
Chunkers: Chunkers split parsed content into retrievable units. This improves relevance because the retriever can return the smallest unit that still contains the answer.
Embedders: Embedders turn chunks into vectors, which are numeric representations of meaning used for similarity search. This enables semantic retrieval even when users do not type the same words as the document.
Vector index: The vector index stores vectors and metadata and supports filtered similarity search. This enables you to restrict retrieval by source, time range, tenant, or user permissions.
Reranker: A reranker re-scores retrieved candidates with a stronger relevance model. This improves precision when the first stage retrieval returns plausible but not correct matches.
Prompt assembler: The prompt assembler formats instructions, question, and retrieved context into a stable template. This reduces variance because the LLM sees the same structure across requests.

Data ingestion and indexing for RAG pipelines

Data ingestion is the process of moving source content into an indexing workflow that produces clean chunks plus metadata for retrieval.

Start by defining what "one document" means for your system, because enterprise sources often mix attachments, comments, and embedded objects.

Parsing is the step that converts raw files into structured elements, and you should treat parsing quality as a first class constraint. If a parser misorders text, drops table cells, or merges headings into body text, your chunker and embedder will faithfully encode the wrong meaning.

A minimum ingestion contract keeps the retrieval layer governable.

Stable IDs: Store a stable document ID and a stable chunk ID so you can update or delete content deterministically.
Source pointers: Store a source URI or path so you can audit answers and support human review.
Timestamps: Store created and modified times so you can manage freshness and avoid reprocessing unchanged content.
Access attributes: Store permission attributes that allow retrieval time filtering by user and tenant.

Index refresh is the mechanism that keeps your RAG application aligned with changing enterprise data. If you only run full rebuilds, you will accept long windows of staleness or long maintenance periods, so most systems adopt incremental sync where only changed objects are reprocessed.

Incremental sync requires a change detector, which is a deterministic method for deciding whether a source object changed. In practice, you use a last modified marker when it is trustworthy, and you fall back to content hashing when it is not.

Embedding is the step that makes retrieval possible, but it also sets long term coupling between your index and your embedder. When you change embedding models, you usually need to re-embed the corpus, so treat embedder changes as schema migrations that you plan, track, and roll back.

Vector storage is where you pay for retrieval latency and operational complexity. If you need fast filtered search, multi-tenant isolation, and predictable scaling, you will favor a vector database that supports metadata filters efficiently and exposes clear operational controls.

Chunk size and chunk strategy for accurate retrieval

Chunking is the process of segmenting parsed content into units that can be retrieved as context. This means chunking is a retrieval design problem, because the retriever can only return what the chunker produced.

Start with a simple rule: a chunk should contain one topic and one local context boundary. If a chunk mixes sections, the LLM sees unrelated material, which increases hallucination risk because the model must guess which part matters.

Chunk size is best defined relative to your LLM context budget and your retrieval goal. If you optimize for short factual answers, smaller chunks improve precision, and if you optimize for explanations, larger chunks improve completeness, so you select a size that aligns with the response shape you want.

Structure-aware chunking is usually the best first choice for enterprise documents because it preserves headers, lists, and table boundaries.

A small set of chunking patterns covers most RAG techniques.

Fixed window: Split by length with optional overlap. This is easy to implement but can break semantic units and degrade answer grounding.
Title based: Split by headings and keep each section intact. This preserves author intent and improves attribution when users ask about a specific policy or procedure.
Page based: Split by page boundaries for scanned PDFs or page structured content. This improves traceability because you can map answers back to a page range.
Similarity based: Split by semantic shifts detected with embeddings. This can reduce topic mixing but adds cost and can be unstable across model changes.
Contextual enrichment: Attach a short document level summary to each chunk. This improves retrieval for ambiguous queries because the chunk carries global context.

Tables deserve special handling because they contain dense facts with strong row and column semantics. If you flatten tables into plain text without preserving relationships, you will retrieve the right table but fail to answer correctly because the LLM cannot reliably reconstruct structure.

Retrieval and ranking techniques that improve recall

Retrieval is the runtime process that selects candidate chunks for a query. This means retrieval is where you decide what the model is allowed to know for that specific request.

The first best practice is to make retrieval observable by logging the query, the retrieved chunk IDs, and the filters applied. When an answer is wrong, you should be able to prove whether the right chunk was retrieved, because this separates retrieval failure from generation failure.

Dense retrieval is semantic search using embeddings, and it is the default for most RAG system design because it tolerates paraphrases. This works poorly when users depend on exact identifiers, so you often add keyword search to cover codes, part numbers, and proper nouns.

Hybrid search is the combination of dense and sparse retrieval in a single retrieval plan. This improves recall because sparse retrieval captures lexical matches while dense retrieval captures semantic matches, and you can blend results before reranking.

Query expansion is the technique of rewriting one query into multiple related queries. This reduces miss rate when the user question is underspecified, but it increases cost and can amplify ambiguity if you do not constrain the rewrites.

Reranking is the step that takes a short candidate list and orders it by relevance using a stronger model. This improves answer quality because the LLM context window is limited, so every slot you allocate to low relevance content reduces the chance the answer is grounded.

Context packing is the step that assembles the final context blocks for the LLM. You should pack chunks with stable formatting, consistent delimiters, and explicit source labels, because inconsistent formatting creates hidden prompt variance that looks like random model behavior.

Compression is the method for fitting more meaning into the same context window. If you summarize too aggressively you lose citations and fine detail, and if you do not summarize at all you overflow context and force truncation, so you apply compression only when a chunk is relevant but too long.

RAG evaluation frameworks and metrics to trust

Evaluation is the discipline of measuring whether your RAG implementation meets its requirements under realistic queries. This means you do not trust a demo query, you trust repeatable tests that isolate retrieval quality from generation quality.

Start with a small golden set, which is a curated set of questions paired with expected source documents and an acceptable answer shape. This enables regression testing because you can run the same evaluation after changes to parsing, chunking, embedding, retrieval, or prompts.

Separate retrieval evaluation from generation evaluation.

Retrieval checks: Validate that the top retrieved chunks contain the needed facts and that filters enforce permissions correctly.
Generation checks: Validate that the answer is faithful to retrieved context and that it follows output constraints like JSON schema or citation format.
End to end checks: Validate that the final answer is correct for the user task and that the system returns useful failures when it cannot answer.

A useful practice is to label failure modes, because each failure mode implies a different fix. When retrieval misses, you adjust indexing, chunking, or search, and when generation hallucinates, you adjust context quality, prompt constraints, or response validation.

Evaluation should also cover operational behavior such as latency budgets, timeouts, and partial failures in upstream systems. If your retriever cannot reach the vector index, you should fail closed with a clear error path, because silent fallback to the LLM without context produces ungrounded answers.

Security practices for production RAG systems

Security is the set of controls that governs what content can be indexed, retrieved, and returned. This means RAG security is primarily a data governance problem with an LLM attached, not a prompt problem you can patch later.

The first control is permission preservation, which is the practice of carrying access rules from the system of record into the index as metadata. At retrieval time, you apply deterministic filters so the retriever only returns chunks the requesting user is authorized to see.

Prompt injection is an attack where retrieved content tries to override system instructions. You reduce this risk by treating retrieved text as untrusted input, delimiting it clearly, and avoiding instructions that invite the model to follow directives found in documents.

PII handling should be explicit at ingestion. If you do not want sensitive fields to appear in answers, you redact or tokenize them before embedding, because once a value is embedded and retrieved it becomes hard to reliably block downstream.

A production RAG process also needs auditability.

Traceability: Store which chunk IDs were retrieved for each response so you can reproduce the context that shaped an answer.
Review hooks: Route low confidence answers or policy scoped queries to human review or stricter templates.
Isolation: Separate tenants, environments, and indexes so you do not mix dev data with production data.

Frequently asked questions

How do I choose between dense retrieval and hybrid search for a RAG application?

Dense retrieval is semantic search with embeddings, and it is the default for natural language questions. Hybrid search adds keyword matching, which you should adopt when users ask with identifiers, strict terms, or domain jargon that embedding similarity can miss.

What chunking strategy works best for long PDFs with tables and multi column layouts?

Structure-aware chunking is splitting by headings and element boundaries, and it works well because it preserves section meaning and table integrity. For dense tables, you also preserve row and column structure so the LLM can reason over relationships instead of flattened text.

What do I log to debug retrieval failures in a production RAG system?

Log the raw query, the derived query representation used for search, the applied filters, and the retrieved chunk IDs in rank order. This gives you a deterministic trail that shows whether the system failed to retrieve the right content or failed to generate from correct content.

When should I re-embed my corpus during a RAG implementation?

Re-embed when you change the embedding model, when you materially change chunking output, or when your normalization rules change the text content fed to the embedder. Re-embedding is an index migration, so you run it with versioned indexes and a controlled cutover.

How do I prevent users from seeing content they cannot access when using retrieval augmented generation?

Permission preservation is storing access attributes with each chunk, and retrieval filtering is enforcing those attributes deterministically at query time. You also test these controls with adversarial queries because permission bugs often appear only under edge cases like shared folders and nested groups.

How do I evaluate a RAG system if I do not have ground truth answers?

Start by labeling which sources are acceptable for each question, and treat retrieval correctness as the first gate. Then evaluate faithfulness by checking that every claim in the answer is supported by retrieved text, which gives you a practical quality signal without requiring a single perfect answer.

Ready to Transform Your RAG Implementation?

At Unstructured, we handle the hard parts of RAG so you can focus on building great AI products. Our platform gives you high-fidelity parsing, structure-aware chunking, and reliable connectors that turn messy enterprise documents into clean, retrievable context—without the brittle pipelines and maintenance overhead. To see how Unstructured simplifies data ingestion and indexing for production RAG systems, get started today and let us help you unleash the full potential of your unstructured data.

Authors

Authors

RAG Systems Best Practices: Unstructured Data Pipeline

RAG system workflow and modules

Data ingestion and indexing for RAG pipelines

Chunk size and chunk strategy for accurate retrieval

Retrieval and ranking techniques that improve recall

RAG evaluation frameworks and metrics to trust

Security practices for production RAG systems

Frequently asked questions

How do I choose between dense retrieval and hybrid search for a RAG application?

What chunking strategy works best for long PDFs with tables and multi column layouts?

What do I log to debug retrieval failures in a production RAG system?

When should I re-embed my corpus during a RAG implementation?

How do I prevent users from seeing content they cannot access when using retrieval augmented generation?

How do I evaluate a RAG system if I do not have ground truth answers?

Ready to Transform Your RAG Implementation?

Title

What Is RAG? Why It Matters for AI Applications

Data Ingestion: Building Modern Data Pipelines

Designing Scalable Data Ingestion Pipelines for AI Workloads