Level Up Your GenAI Apps: Data Processing Power-Ups

Maria Khalusova

RAG

In the previous blog post, we've laid the data processing groundwork by covering ingestion, partitioning, chunking, embedding, and basic indexing. Now, let's talk about how we can further enhance this foundation with more sophisticated preprocessing techniques.

Contextual Chunking: Smarter Chunks with Added Context

Let’s start with chunking. It’s tempting to think of chunking as a solved problem. Pick a method, e.g. fixed length, section-based, etc., and start splitting your documents. But all of the chunking strategies, even the smart ones, introduce an issue: they don’t just divide text, they fragment context.

Imagine chunking thousands of lengthy legal documents - a segment from any of these documents on its own will contain templated legal language, but have zero context about what original document it belongs to. A user of the downstream system looking for answers about a specific contract, may encounter a chunk that mentions termination clauses or payment terms, but have no idea which agreement those terms apply to. This lack of context can lead to confusion, irrelevant answers, or even hallucinated connections between unrelated contracts.

Smart chunking strategies can preserve local context to some extent but have no mechanism to preserve the global meaning of complex documents.

Contextual Chunking solves this by prepending each chunk with a concise summary of its parent document before embedding. This technique aims to preserve the vital relationship between a chunk and its broader document, ensuring that even a small segment of text is understood within its original document context.

By adding document-level context to individual chunks, this makes them semantically "fuller". This has shown to have a significant impact on downstream RAG performance. Contextual Chunking alone has been shown to reduce the top-20-chunk retrieval failure rate by an average of 35% across various domains, but the number can get even higher when additional advanced strategies are employed. You can learn more about accuracy improvements and how this technique works under the hood in this blog post.

In Unstructured, you can activate Contextual Chunking with a simple toggle in Chunker Node settings - this method integrates seamlessly with all existing chunking strategies.

Multimodal Enrichments: Unlocking Information from Images and Tables

Text alone doesn’t tell the whole story. Critical signals often live in images and tables. If your pipeline ignores these elements or handles them naively, you're leaving valuable insights on the table.

The Problem: Hidden Semantics in Non-Textual Elements

Images can contain labeled diagrams, screenshots, product photos, and more.
Tables often distill dense data into structured summaries.
Scanned documents or slides may include tables as images that resist typical parsing methods.

If these aren’t enriched, your LLM sees blank space where meaning should be.

Image Description: Capturing Semantics of Visual Elements

Images can convey essential information, whether it’s a scanned diagram, a chart, or a product photo in a user manual. With Image Description enrichment, Unstructured uses VLMs to generate concise natural language summaries of visual content. These descriptions are then used as the element’s text and can be embedded directly as chunks, making them searchable and usable in downstream applications.

Table Description: Summarizing Structured Data

Tables often encapsulate dense and nuanced data, but extracting their meaning requires more than just parsing rows and columns. Table Description enrichment applies VLM-based summarization to extract key takeaways from each table, transforming structured data into natural language insights.

Table to HTML: Preserving Structure for Better Parsing

When tables are contained as images in scanned PDFs or slides, traditional parsers struggle to extract usable structure. Table to HTML enrichment solves this by applying VLMs to convert table images into clean HTML representations.

This not only captures the full richness of the table’s structure with all the headers, merged cells, nested rows, and so on, but also enables more precise downstream processing. HTML-format tables can be embedded, rendered in UI applications, or processed further with LLMs that understand HTML semantics.

Together, these enrichments extend Unstructured’s processing capabilities beyond plain text, ensuring your LLM stack benefits from all the signals in your unstructured documents.

Named Entity Recognition: Extracting Structured Knowledge

While chunking and enrichment techniques help LLMs understand what a document says, Named Entity Recognition (NER) enrichment goes a step further by helping systems understand who, what, and where the text is referring to at a structured, machine-readable level.

Unstructured’s NER enrichment automatically extracts named entities, such as people, organizations, locations, dates, and domain-specific entities, from each chunk of text. But this enrichment doesn’t stop at surface-level mentions. It also captures relationships between entities, enabling a deeper understanding of how they interact within the document.

For example, from an element like:

{
  "element_id": "00bc4b34a2d76eb7303068bff149ef43",
  "text": "On January 15, 2025, Apple Inc. announced the launch of its latest product, the iPhone 15, during a press event held at their headquarters in Cupertino, California.\nThe event was streamed live on their official Twitter account @AppleEvent.\nThe company expects to make over $50 billion in sales during the first quarter of 2025, with projections showing a 10% increase compared to the previous year.\nApple’s CEO, Tim Cook, was joined by the company’s CTO, John Doe, and other key executives.\nThe new iPhone boasts advanced features powered by Apple’s M2 chip, which is expected to revolutionize mobile technology.\nThe release event also highlighted Apple's ongoing commitment to sustainability, with the iPhone 15 being made from 100% recycled materials.\nThe event concluded with a celebration of Apple’s 50th anniversary.",
  "type": "CompositeElement",
  "metadata": {
    "filename": "mypdfs/apple.pdf",
    "page_number": 1
  }
}

The NER enrichment might extract:

{
    "items": [
      {
        "entity": "Apple Inc.",
        "type": "ORGANIZATION"
      },
      {
        "entity": "iPhone 15",
        "type": "PRODUCT"
      },
      {
        "entity": "January 15, 2025",
        "type": "DATE"
      },
      {
        "entity": "Cupertino",
        "type": "LOCATION"
      },
      {
        "entity": "California",
        "type": "LOCATION"
      },
      {
        "entity": "$50 billion",
        "type": "MONEY"
      },
      {
        "entity": "10%",
        "type": "PERCENT"
      },
      {
        "entity": "Tim Cook",
        "type": "PERSON"
      },
      {
        "entity": "John Doe",
        "type": "PERSON"
      },
      {
        "entity": "M2 chip",
        "type": "PRODUCT"
      },
      {
        "entity": "Apple’s 50th anniversary",
        "type": "EVENT"
      }
    ],
    "relationships": [
      {
        "from": "Apple Inc.",
        "relationship": "founded_on",
        "to": "Apple’s 50th anniversary"
      },
      {
        "from": "Apple Inc.",
        "relationship": "based_in",
        "to": "Cupertino"
      },
      {
        "from": "Apple Inc.",
        "relationship": "has_office_in",
        "to": "California"
      },
      {
        "from": "iPhone 15",
        "relationship": "developed_by",
        "to": "Apple Inc."
      },
      {
        "from": "Tim Cook",
        "relationship": "has_role",
        "to": "CEO"
      },
      {
        "from": "John Doe",
        "relationship": "has_role",
        "to": "CTO"
      },
      {
        "from": "Apple Inc.",
        "relationship": "occurred_on",
        "to": "January 15, 2025"
      },
      {
        "from": "Apple’s 50th anniversary",
        "relationship": "occurred_in",
        "to": "Cupertino"
      },
      {
        "from": "Apple’s 50th anniversary",
        "relationship": "occurred_in",
        "to": "California"
      }
    ]
  }
}

This structured metadata is stored alongside the chunk, enabling a range of powerful downstream applications, chief among them: GraphRAG, which we briefly covered in one of the earlier posts from this series.

The NER enrichment in Unstructured acts as the foundational layer for building knowledge graphs. As documents are processed and chunked, each chunk’s named entities and their relationships can be:

Added to a graph database like Neo4j
Added to AstraDB to be used with GraphRetriever library for lightweight dynamic knowledge graphs
Used to trace high-confidence knowledge paths between concepts

This approach fundamentally shifts how retrieval works in RAG—from pure semantic similarity to explicit reasoning over known facts and relationships.

As with all enrichments in Unstructured, NER is composable: it works alongside chunking, multimodal enrichments, and embeddings to give you full control over how your pipeline evolves.

Conclusion

From preserving global context in chunks, to interpreting visual data, to structuring knowledge for graph-based reasoning: these advanced enrichments unlock new levels of capability for RAG. And the best part? All of these techniques are available in Unstructured today, ready to be enabled, combined, and adapted to your pipeline. The tools are here. The performance gains are real. The only question is: what will you build with them?