Nov 1, 2023

The Toolkit for Connecting LLMs to Your Data


Data Staging

In Spring 2022, we launched Unstructured to tackle a problem that burdened us for years — transforming raw files containing text into a format readable by machine learning models. Tired of building custom document processing code for every new project, we sought to create a single solution for turning everything from PDFs to Word documents and markdown into usable data for NLP models. By the Fall of 2022, we released our first open source package and steadily improved its capabilities in tandem with early users.

As we moved into 2023, ChatGPT took the world by storm and we were well positioned to support the legion of engineers seeking to connect diverse, unwieldy datasets to LLMs. Usage of our open source package skyrocketed, and we introduced a free API to help developers get up and running even faster. Building on this momentum, Unstructured now supports thousands of users, hundreds of organizations, and has handled the complex ingestion and preprocessing of tens of millions of workloads across the LLM ecosystem.

Today, we have reached a new turning point in our journey. Our goal moving forward is to harden our open source tools into production grade capabilities enterprises can rely on as they begin to operationalize LLM applications. In support of this mission, the team at Unstructured will work across three product lines: open source, API, and our upcoming enterprise platform.

Unstructured: Open Source

As we build out production offerings, the team will continue to support the open source library as the tool of choice for individual developers building prototype applications. With our core file transformation capabilities now in place, our goal is to maintain the library as a simple, reliable entry point for building LLM applications on your data.

Unstructured: API

Unstructured’s paid, production API provides the next level of support for development teams as they turn prototypes into live tools that are ready to go without any custom code required. This API offering includes premium features that provide a higher level of performance than the open source library, including:

  • Our latest vision transformer models for PDF and image processing

  • More accurate table and image extraction

  • Additional chunking, metadata, and hierarchy extraction capabilities critical to RAG-based architectures

For users who require privacy of their data and improved performance, starting today our production API is available on the Azure marketplace (and coming soon in AWS) where you can now preprocess unstructured data accurately and securely within your own VPC.

Additionally, users can still access our free API, but we will be introducing usage caps in the coming weeks.

Unstructured: Enterprise Platform

We are excited to announce that we have broken ground on the Unstructured Enterprise Platform, which will provide users with a full-featured ETL experience including:

  • Fully supported upstream and downstream connectors

  • Job scheduling and monitoring

  • Incremental data loads

The Enterprise Platform will automate the LLM ETL process from the document source to the vector database and — most crucially — will expose features like advanced chunking aimed at improving the performance of downstream RAG systems. With the Enterprise Platform controlling ingest, ML developers are free to focus on the critical task for turning LLM magic into business defining tools.

Make sure to follow us on LinkedIn and Twitter to keep up with the latest product updates, and don’t forget to join our community Slack to share your thoughts and feedback. We look forward to learning more about your use cases and becoming your go-to tool for LLM document ingestion and preprocessing.