Jan 2, 2024

Unstructured's Commercial SaaS API



Supporting File Transformation for LLMs from Prototype to Production.

Since launching our open source library in the fall of 2022, our community has enjoyed tremendous growth. In a little over a year, the Unstructured library has surpassed 4 million downloads and is used in nearly 10,000 public GitHub repositories, 100 Python packages, and behind the scenes in dozens of LLM-powered products. That growth, however, has come with challenges. Processing 25+ document types with diverse formats requires a maze of models and dependencies, making the open source library difficult to install and manage.

To provide a more seamless experience and position us to rapidly deliver new features to users, we are ramping investments in our commercial APIs. We will provide three versions of the API:

  1. A commercial SaaS API, hosted by Unstructured

  2. A free-tier SaaS API, hosted by Unstructured and capped at 1,000 pages per month

  3. Marketplace APIs available in Azure and AWS that allow users to run the API in their own VPC

The commercial SaaS API is the primary tool for supporting production file transformation workloads. To enhance data security and reliability, each commercial API user gets dedicated cloud infrastructure. That means your data never gets co-located with other users’ data and heavy traffic from another customer does not impact how the API performs for you. We’re also undergoing SOC2 certification to provide another layer of assurance that your data is safe with our SaaS API. 

The commercial SaaS API works on a pay-as-you-go basis. You only pay for what you use. The pricing per hour of compute is $2.66. For example, if you're using the API moderately with a few tasks running for an hour every day over a month, the bill would be around $79.80.

The existing free-tier SaaS API will continue operating, but usage will be capped at 1,000 pages per month. In an effort to improve our file transformation tools, we will also begin collecting and storing documents users upload to the free-tier API. These documents will be used for model training and evaluation. If you have data that you don’t want used for model training, you should move to the commercial API or use our open source offering.

Our Marketplace APIs provide the same capabilities as the commercial SaaS, but run in a customer’s VPC. This is the ideal solution for customers with sensitive data that they can’t send outside of their network. Check out the Azure and AWS deployment guides for details on how to get started.

We will continue to maintain the open source Python library as a prototyping tool, but new features will only be available through our API offerings. For example, the API includes a more accurate table extraction model. Additional enhancements that we’ll roll out during early 2024 include:

  • Additional advanced chunking options, including chunking based on content similarity, hierarchies detected in documents, and chunking that accounts for embedded images. These chunking strategies will improve downstream RAG performance.

  • Support for audio formats, such as MP3, MP4, and WAV files

  • Processing embedding images within documents

  • Improved hierarchy detection within documents

Stay in touch with us on LinkedIn and Twitter to keep up with the latest updates on our API offerings. If you have a use case you’d like to discuss, join us on Slack or reach out to us at hello@unstructured.io. We love to learn about what you're doing and how we can help make you successful with LLMs.