Aug 22, 2024
Using Onyx (formerly Danswer) with Unstructured for Production RAG Chat With Your Docs
Nina Lopatina
RAG
Note: There is a newer integration with Onyx (fka Danswer) described in https://www.onyx.app/blog/danswer-unstructured and available at https://github.com/onyx-dot-app/onyx
This blog post can still help you integrate Unstructured's Serverless API into a repo of your choice
Introduction
RAG is increasingly moving from pilot to production, and there are a lot of tools out there to help with the deployment process. Danswer is an open source AI assistant for chatting with your enterprise documents. In this blog post, we talk about how we added an Unstructured integration to Danswer to process documents saved in your Google Drive via our Serverless API, and how this augmented the results from the original extraction implementation. In just 5 easy steps, you can integrate Unstructured into any production-ready system! All of the code is integrated in our fork, which expands the available file types for parsing by adding 13 additional file types!
Danswer is an AI Assistant that connects to your company’s docs, apps, and people. Danswer provides a Chat interface and plugs into any LLM of your choice. Danswer can be deployed anywhere and for any scale - on a laptop, on-premise, or to cloud. The system also comes fully ready for production usage with user authentication, role management (admin/curators/basic users), chat persistence, and a UI for configuring AI Assistants and their Prompts. Check out our fork with the Unstructured pre-processing, and read below (or check out the PR) to see what we changed, and why we made these updates!
You can also check out our video walk through: Integrating Unstructured with Danswer: A Step-by-Step Guide.
Codebase integration in 5 easy steps:
Added two new functions to danswer/backend/danswer/file_processing/extract_file_text.py:
After importing Unstructured Client and other functions, we are setting up SDK partition requests, and the function read_any_file to read files via the Unstructured API. This theoretically enables parsing of any of the file types we support, but we have only set up a few in the Google Drive connector. We would also recommend using our Ingest pipeline for processing large files or multiple files, but used the SDK in this implementation as a proof of concept.
We also added these additional PLAIN_TEXT_FILE_EXTENSIONS:
and additional VALID_FILE_EXTENSIONS:
Updated the Google drive connector at danswer/backend/danswer/connectors/google_drive/connector.py to use our new read_any_file function by changing the extract_text function. Note that this currently works for Unstructured's supported file types, but not Google Drive sheets/slides/docs – we will share an update shortly with those.
Update backend/requirements/default.txt as per the linked .txt – this took a bit of trial and error, and changing the required versions for some of Danswer’s requirements, but this is a stable configuration that builds.
Update Dockerfile with additional dependencies for the Unstructured library:
As a background variable, and save the actual value of your Unstructured API key in a .env file in that same directory!
These are the main changes we made to integrate Unstructured! This is how little code it took to enable Unstructured in a production-ready chat application – feel free to copy this code to try it out yourself, and reach out at our community Slack if you have any questions.
How to run
To run this, we used the local Docker deployment setup from Danswer’s quickstart, with our fork:
Clone the our fork of the Danswer repo:
Navigate to danswer/deployment/docker_compose
Bring up your docker engine and to build the containers from source and start Danswer, run:
Danswer with the Unstructured parser for Google Drive will now be running on http://localhost:3000
Results
Here are a few comparisons, before and after, using the lines from the first episode of the office as a stand in for a call transcript:
Before Unstructured integration:
Danswer does not have inbuilt support to process this file as a .xlsx or .csv directly from Drive, the file would need to be uploaded directly in to the chat
After Unstructured integration:
Conclusion
This is just a quick step to get started using the Unstructured API with Danswer, we are looking forward to further collaboration with the Danswer team for a deeper integration. This lightweight integration is missing Ingest pipelining for faster speed, metadata extraction (since that is handled elsewhere), and processing from additional sources besides Google Drive. But now you can process 13 additional file types in Danswer! Let us know if you’re interested in seeing more!