Building Reliable GenAI Applications with Unstructured and Vectara

Ofer Mendelevitch (Vectara) & Ronny Hoesada

RAG

Unstructured and Vectara offer a powerful combination to streamline the development of reliable, data-driven GenAI applications. Unstructured's advanced preprocessing capabilities break down the barriers of unstructured data (like PDFs and presentations), while Vectara's trusted GenAI platform empowers you to build intelligent applications fueled by that data. In this joint blog post with Vectara, we demonstrate how to ingest the Consumer Financial Protection Bureau (CFPB) reports into Vectara using Unstructured's Destination Connector, ultimately creating a question-answering application using Vectara’s tool to provide valuable insights into financial trends and regulations.

Introduction

Building reliable GenAI applications that deliver accurate and trustworthy insights is crucial. Still, it's often hampered by the complexities of unstructured data and the time-consuming nature of traditional development processes. That's where the combined power of Unstructured and Vectara offers a significant advantage.

Unstructured is a leading LLM data preprocessing solutions provider, empowering organizations to transform their internal unstructured data into formats compatible with large language models. By automating the transformation of complex natural language data found in formats like PDFs, PPTX, HTML files, and more, Unstructured enables enterprises to leverage the full power of their data for increased productivity and innovation.

Vectara provides a trusted Generative AI platform that allows organizations to rapidly create a ChatGPT-like experience (an AI assistant) grounded in the data, documents, and knowledge that they have. Their serverless RAG-as-a-Service solves critical problems required for enterprise adoption, including reducing hallucination, providing explainability/provenance, enforcing access control, allowing for real-time knowledge updates, and mitigating intellectual property/bias concerns from large language models.

Together, Unstructured and Vectara streamline the process of turning unstructured data into the fuel that powers accurate and insightful GenAI applications. Unstructured's advanced preprocessing capabilities handle diverse file types, seamlessly preparing your data for Vectara's powerful GenAI engine. This collaborative approach accelerates development time, empowers data-driven decision-making, and ensures the reliability of your GenAI solutions.

Let's explore a real-world use case to see the power of Unstructured and Vectara in action. Imagine you want to tap into the valuable insights buried within the reports of the Consumer Financial Protection Bureau (CFPB). These documents hold a wealth of data about consumer trends, financial risks, and regulatory insights. We will demonstrate how to effortlessly integrate these reports into a GenAI application, empowering you to extract critical information easily. For this blog post, we will use 7 specific CFPB reports from 2023-2024, including the full CFPB annual 2023 report and specific reports about student loans and the mortgage market.

Using Unstructured’s Destination Connector for Vectara

We will use the Unstructured Connector and build a question-answering demo based on this data using Vectara’s Create UI library. We placed our PDF files into a local folder for simplicity, but what we show below can as easily work with other sources, such as an AWS S3 folder, Azure blob storage, or Google Cloud Storage (GCS).

Before we start, we need two items from the Vectara side OAuth 2 configuration: the client ID and secret. You can copy those from your Vectara console, under “API access”:

If you click the “Copy” button, you will copy the client ID to your clipboard, and with the dropdown menu on the right, you can copy the client's secret. You must also specify your Vectara customer ID, so copy that from your account console.

First, we install Unstructured. To do a clean install, you can use Conda

conda create -n unst python=3.11
conda activate unst
pip install "unstructured[local-inference]" httpx

Our large PDF file is under the “/Users/ofer/dev/data/cfpb-reports” local folder. We can ingest this PDF by executing the following command:

unstructured-ingest \
local  
--input-path "/Users/ofer/dev/data/cfpb-reports" \
--strategy"hi_res" \
vectara \
--oauth-client-id"<VECTARA-OAUTH-CLIENT-ID>" \
--oauth-secret"<VECTARA-OAUTH2-SECRET>" \
--customer-id  "<VECTARA-CUSTOMER-ID>" \
--corpus-name"GenAI-demo"

This runs the “unstructured-ingest” command with the following arguments:

The source connector is specified as “local” and the “--input_path” argument points to the local folder path. You can use ingest configurations, such as the “--strategy” argument.
The destination connector for Vectara requires 4 arguments: the OAuth Client ID, Secret, Customer ID, and the Corpus Name you want to use in Vectara.
- If a corpus by that name already exists in your account, it will be used. Otherwise, a new corpus with that name will be automatically created for you.

Querying the data

Now that the data is ingested into Vectara, we can issue queries and chat with the data using Vectara’s query API, using a tool like vectara-answer or create-ui.

Let’s try an example with create-ui. To use it we must make sure Node and NPM are installed, and then simply install create-ui using:

npx @vectara/create-ui

This installs and runs the package, walking you through a 3-step installation process.

First, you can select the type of GenAI UI you want to use. In this case, we chose the “Question answering” variant.
Then we choose the “Use my own data”'to point the create-ui application to the CFPB documents already ingested into Vectara.
After providing a name for the application (we named it “cfpb”), and providing the customer ID, corpus ID, and API key, you can define a few questions that are pre-populated with the application, like “What are the risks with student loans?" or “What is the CFPB?”.

That’s it! `create-ui` generates the application in the “cfpb” folder and all you have to do is run it in that folder:

cd cfpb
npm install
npm run start

Now let’s try “What are the risks with student loans”?

The output is a generative summary based on the CFPB documents ingested, along with the list of citations that the application used to generate this summary, a great example of Retrieval Augmented Generation at its best.

Conclusion

Unstructured and Vectara's combined expertise provides an unparalleled solution for building GenAI apps that harness the insights trapped within your unstructured data. Employing Unstructured tools will pave the way for seamless data ingestion into Vectara's cutting-edge RAG platform. This partnership empowers you to overcome the complexities of unstructured data, accelerating the development of intelligent GenAI applications that tap into a vast pool of knowledge.

Ready to build a reliable GenAI application? Visit Vectara’s sign up page for a free Vectara account if you don’t have one already and the Unstructured website to learn more. Also, join Unstructured Community Slack to collaborate with like-minded individuals, receive direct support, and stay informed about the latest advancements in this exciting field. Together, Unstructured and Vectara offer a powerful toolkit for unlocking the full potential of your GenAI initiatives.