Scarf analytics pixel

Apr 3, 2025

Getting Started with Unstructured and Delta Tables in Databricks

Maria Khalusova

RAG

The Challenge: Fragmented Enterprise Data

Critical context needed for retrieval-augmented generation (RAG) systems, or LLM agents, is often fragmented across various enterprise systems—blob storage, company wikis, Google Drive, Dropbox, Slack, and more. Beyond the complexity of authentication mechanisms and data ingestion, unstructured data itself exists in a wide range of formats—PDFs, Word documents, spreadsheets, emails, and more—making it difficult to consolidate and prepare for GenAI use cases.

The Solution: Unstructured + Databricks

Unstructured simplifies this challenge by providing a robust preprocessing layer that connects to enterprise data sources, extracts content, and transforms it into structured JSON format. With built-in connectors, AI-driven transformations, and flexible workflows, Unstructured ensures your data is accurately extracted, enriched, and formatted for downstream use, no matter where it comes from.

In this getting started guide, we’ll show how to set up a data preprocessing workflow that automatically converts a collection of heterogeneous documents in Amazon S3 into organized, structured data in a Delta Table in Databricks, ready for RAG use with Databricks Vector Search.

End-to-End Workflow Overview

This is what the final workflow will look like: 

The process follows a simple yet powerful path: the unstructured documents, PDFs in this case, are ingested from S3 via an S3 source connector, partitioned using Claude Sonnet 3.7 into structured JSON, which is then chunked with one of the Unstructured’s smart chunking strategies. For each chunk, an embedding vector is generated using OpenAI’s text-embedding-3-small model, and the results are uploaded into a Delta Table in Databricks via a Delta Table destination connector. Once the workflow is set up, it can run automatically on your schedule, constantly keeping your knowledge base fresh and up-to-date. Let's build it together!

Prerequisites

Unstructured Account

To start transforming your data with Unstructured Platform, you'll need to sign up on the Unstructured For Developers page. Once you do, you can log into the Platform and process up to 1000 pages per day for free for the first 14 days.

Amazon S3 Setup

Unstructured can ingest documents from a wide variety of data sources, and you can ingest data from more than one source in a single workflow, however, to keep this guide simple, we’ll only use one source - Amazon S3. 

You’ll need an AWS account, along with your AWS secret key and AWS secret access key for authentication. You'll also need an Amazon S3 bucket with the correct access settings applied. Learn how to set up an S3 bucket for Unstructured. Make sure to upload some files to your S3 bucket, so there's something to process! 😉 Use this list of supported file types to learn what Unstructured can process.

Databricks Configuration

You’ll need to have a Databricks account, and if you’re creating a new account, make sure to complete the cloud set up. Note down the URL of your workspace and check that you have Unity Catalog enabled in the workspace. 

Within your Unity Catalog, you’ll need: 

  • A catalog, in this example, it will be called maria_catalog

  • A schema, here it will be maria_catalog.default

  • A volume, here - maria_catalog.default.demo_volume

For this guide you will also need to have a SQL warehouse, which will be used to create and manage tables in your schema. Once you create and start your SQL Warehouse, click on it, and navigate to the Connection details tab. Copy the values for the Server hostname and HTTP path.  

For authentication, we recommend using a Databricks managed service principal, and not a personal token. If you do not have it yet, you’ll need to create it, and grant this service principal all the necessary permissions. In your catalog, your service principal needs to be granted the following permissions: USE CATALOG, USE SCHEMA, READ VOLUME, WRITE VOLUME, CREATE TABLE.  In your SQL warehouse, your service principal needs to have Can use permissions. 

The detailed information on everything you need to know in order to set up your Databricks workspace to work with Unstructured, including all necessary permissions, is listed in this documentation page. We even included videos to help you get started!

Building the Workflow

Step 1: Create an S3 Source Connector

Log in to your Unstructured account, click Connectors on the left side bar, make sure you have Sources selected, and click New to create a new source connector. Alternatively, use this direct link. Choose S3, and enter the required info about your bucket. If you’re unsure how to obtain the required credentials, Unstructured docs contain helpful instructions and videos that guide you through the setup of your S3 bucket for ingestion and obtaining the necessary credentials - check it out.

Once the connector is saved, Unstructured will check to make sure it can successfully connect to your S3 bucket. 

Step 2: Create a Delta Table in Databricks Destination Connector

To create a destination connector, navigate to Connectors in the Platform UI, switch to Destinations, and click New; or use this direct link.

Give your connector a descriptive name, and choose Delta Tables in Databricks as the provider: 

Next, enter:

  1. The Server hostname and HTTP path values that you can find in your SQL Warehouse’s settings under the Connection Details tab. 

  2. Your service principal’s ID and secret. 

Finally, click continue, and enter the remaining information:

Note

You don't need to specify a table name—Unstructured will create one automatically! 

If you prefer to create your own table, just make sure it matches our schema requirements, which you can find in the documentation

Save the connector. Unstructured will check to make sure it can successfully connect to your Dataricks account, catalog, volume and SQL warehouse. 

Step 3: Configure Your Processing Workflow

Having a source and a destination connectors are the necessary prerequisites to building a data processing workflow - Unstructured needs to know where to ingest the data from, and where to upload the results of data processing. Now that you have set up the S3 source and the Delta Table in Databricks destination connectors, you can proceed to the best part - creating a data processing workflow! 

Navigate to the Workflows tab in Unstructured Platform, and click New Workflow. Choose the Build it Myself option and continue. Your workflow DAG starts with the three essential components: Source, Partitioner, and Destination

Click on the DAG nodes for the Source and Destination and select your S3 source and Delta Table in Databricks connectors. 

A Partitioner is a required node in any workflow. This step transforms your unstructured data into structured JSON format using one of the available partitioning strategies: 

  • Auto (default): A dynamic meta-strategy that selects the optimal partitioning approach—either VLM, High Res, or Fast—based on a document's unique characteristics, such as file type and page content. This strategy intelligently applies the most suitable method to minimize processing costs while maximizing accuracy.

  • VLM: A partitioning strategy that uses state-of-the-art VLMs (Vision-Language Models) to extract content from complex documents where traditional OCR struggles. It's ideal for challenging cases like noisy scans, handwriting, and complex nested forms.

  • High Res: This approach combines advanced OCR with document understanding capabilities, making it effective for image-based documents like scanned files. Can handle documents with images and simple tables.

  • Fast: A rapid strategy optimized for quick text extraction from documents such as markdown or office suite files. It's a cost-effective solution for documents with easily extractable text and no images.

For this guide, we’ll opt for the VLM partitioner. You can choose a model you want to be used to transform the documents, for example Claude Sonnet 3.7:

Click on the plus buttons between the nodes to explore what other data transformations you can add to your DAG. Unstructured offers chunking nodes, embedding nodes, and a set of data enrichment transformations. 

To complete a simple data transformation pipeline for a RAG application, we’ll add a Chunker node, and an Embedder node:

The documents coming out of the Partitioner node are structured as JSON containing so-called document elements. Document elements have a type, e.g. NarrativeText, Title, or Table, they contain the extracted text, and metadata that Unstructured was able to obtain. 

Elements can be large pieces of text, such as a lengthy paragraph or a large table, or can be small, like individual list items or headers. 

The chunking node in the workflow uses one of the Unstructured’s smart chunking strategies to rearrange the document elements into perfectly sized “chunks” to stay within the limits of an embedding model while incorporating a reasonable amount of context - all to improve retrieval precision. You can learn more about chunking here

Finally, the Embedder node uses an embedding model of your choice to generate a vector representation (embedding) of every chunk, which you need for similarity search downstream.

If no particular node in the DAG is selected, you can find general workflow setting to modify in the right panel, such as: 

  • Workflow name

  • Schedule

  • Whether all documents need to be reprocessed each time the workflow is triggered, or only new ones

Give your workflow a name, choose a schedule, and save! And just like that, you have a full working data processing workflow to prepare all of your data. On the Workflows page, set your workflow as Active and click Run to trigger a job that will execute your workflow. 

Step 4: Track Job Progress

Navigate to the Jobs tab to track the progress. Here you can click on your jobs to explore what they are doing. Once the job completes, you’ll find the details here, including error logs, if any:

View Your Processed Data in a Delta Table

Once the job is finished, your data is processed, and you can find it in your catalog under the schema you’ve specified. If you haven’t provided a table name when setting up your destination connector, Unstructured will create a table that will have a name that contains part of the workflow id, and the embedding model details, if you have an Embedder node. 

If Unstructured created the table, it will also initially be only accessible to your service principal. Give your user permissions to SELECT from this table to view the results: 

Conclusion

And there you have it! You have set up a complete data processing workflow with Unstructured Platform. You've learned how to pull documents from your Amazon S3 bucket, transform them into structured JSON, create vector embeddings, and store the results in a Delta Table in Databricks. With Unstructured’s data processing capabilities, your previously unstructured data is now ready to power your RAG applications, or supply context to your LLM agents. 

It's time to put your knowledge into action! Try Unstructured Platform with a 14-day trial, allowing you to process up to 1,000 pages per day. For enterprises with more complex needs, we offer tailored solutions. Book a session with our engineers to explore how we can optimize Unstructured for your unique use case.

Keep Reading

Keep Reading

Recent Stories

Recent Stories

Apr 3, 2025

Getting Started with Unstructured and Delta Tables in Databricks

Maria Khalusova

RAG

Apr 3, 2025

Getting Started with Unstructured and Delta Tables in Databricks

Maria Khalusova

RAG

Apr 3, 2025

Getting Started with Unstructured and Delta Tables in Databricks

Maria Khalusova

RAG

Apr 2, 2025

Jira Integration in Unstructured: Build RAG Systems with Project Management Data

Unstructured

Unstructured

Apr 2, 2025

Jira Integration in Unstructured: Build RAG Systems with Project Management Data

Unstructured

Unstructured

Apr 2, 2025

Jira Integration in Unstructured: Build RAG Systems with Project Management Data

Unstructured

Unstructured

Apr 1, 2025

Unstructured MCP Hackathon Recap

Unstructured

Unstructured

Apr 1, 2025

Unstructured MCP Hackathon Recap

Unstructured

Unstructured

Apr 1, 2025

Unstructured MCP Hackathon Recap

Unstructured

Unstructured