Apr 13, 2023

Leveraging Enterprise Specific Data With LLMs: How Unstructured Unlocked 100k+ Pages of IRS Manuals

Unstructured

LLM

Unstructured makes it fast and easy to preprocess organizations’ internal data and render it into a format that can be utilized in conjunction with LLMs. Rather than hacking together custom python scripts, regular expressions, and open source OCR packages, you can send almost any raw file containing natural language to Unstructured’s API and receive back nice, clean JSON. In this blog, we show how the IRS, for example, could rapidly deploy an LLM solution with their data for their employees. This architecture is extensible to any organization wanting to deliver a ChatGPT-style experience with their data.

Getting Down to Work

Scraping

We started with grabbing more than 100k pages of IRS manuals — largely in PDF format — from the IRS government website. Note that you can use Unstructured’s API not only to preprocess PDFs, but also HTML, MSFT Office file types, emails, and more.



Preprocessing

Once we’d gathered our data, the first step in this project is utilizing Unstructured’s API (or you can deploy our image on your hardware) to preprocess the raw PDFs and transform them into clean JSON. See the readme here to follow the install instructions and clone the demo repo. Once you’re ready to use Unstructured, here’s the one and only command you’ll need for turning your data into easily digestible content:

PYTHONPATH=. ./unstructured/ingest/main.py \
--local-input-path <ingest-input-dir> \
--structured-output-dir <ingest-output-dir> \
# optional parameter -> this will hit the *NEW* API vs. processing locally
--partition-by-api



If you’re extracting data locally without using the API, you can increase throughput with the — num-processes parameter. E.g., 8 processes if running on hardware with 64gb available. Below is just one example of how Unstructured will transform the raw data into a structured JSON format.



Once Your Data has Been Preprocessed Working with Structured Data

Once Unstructured has done the heavy lifting of converting the raw files to usable JSON, we can nest the preprocessed data within an architecture that allows an LLM to benefit from this organization-specific data.



(For more detail on each of those make sure to checkout our article LLMs and the Emerging ML Tech Stack here.)

For this particular project we tried out Pinecone for storage (we’ve also had great luck with Chroma, Weaviate, Qdrant, and others), OpenAI for embeddings and LLM (because it was easy…but we could have easily gone to Hugging Face to snag open source alternatives), and LangChain as a programming framework (Llama Index works great too!). Again it’s important to note that once we’re working with the preprocessed data it’s easy to experiment with different downstream libraries. For example, if hybrid search seems like a compelling way to go, it’d be easy to evaluate Llama Index and/or LangChain + Supabase.



Chat Your IRS Data: We’re ready to field some questions!

Once the vector DB is populated with all 100k preprocessed documents and their corresponding embeddings, all we’ve got to do is query. Come one, come all to one of the two options below and bring all the musings you’ve ever had on IRS policy, procedure, and process.

Here are the two ways you can check this out for yourself.

  1. Our Hosted Instance

  1. Running the CLI app yourself


Next steps

Data is powerful, but only if we can make use of it. With Unstructured, we’re excited to help enterprises exploit their internal data with LLMs. We’re continually adding to our natural language preprocessing capabilities and expanding the number of data connectors we support. No matter where your natural language data resides or what file type its contained in, Unstructured has got you covered.

Github repo → Clone the demo repo and connect to your own data source

Community Slack → Join our growing community

Hosted Instance → Chat with the IRS Manuals yourself

Example questions for Chat Your IRS Docs:

  • Are there regulations around email communication?

  • What is the difference between federal and state tax?

  • Who is the head of the IRS?

  • How are penalties determined for late filings?

  • Tell me about the Whistleblower Office

  • tell me about Tax and Fingerprint Checks needed for experts

  • What is the process of making an appeal?

  • Tell me about Daily Delinquency Penalty

  • When are taxes owed?

  • Who has to pay taxes?

  • How do I process amended tax returns?

  • How do I investigate charitable contribution deductions?

  • What kinds of tax credits are there?

  • Do churches pay taxes?

  • Tell me about form 709

Keep Reading

Keep Reading

Recent Stories

Recent Stories

Apr 15, 2024

Supercharge RAG Performance Using OctoAI and Unstructured Embeddings

Ronny H

LLM

Apr 2, 2024

Building Unstructured Data Pipeline with Unstructured Connectors and Databricks Volumes

Ronny H

Unstructured

Mar 9, 2024

Identity enabled RAG using Pebblo

Unstructured

LLM

Apr 15, 2024

Supercharge RAG Performance Using OctoAI and Unstructured Embeddings

Ronny H

LLM

Apr 2, 2024

Building Unstructured Data Pipeline with Unstructured Connectors and Databricks Volumes

Ronny H

Unstructured

Mar 9, 2024

Identity enabled RAG using Pebblo

Unstructured

LLM

Apr 15, 2024

Supercharge RAG Performance Using OctoAI and Unstructured Embeddings

Ronny H

LLM

Apr 2, 2024

Building Unstructured Data Pipeline with Unstructured Connectors and Databricks Volumes

Ronny H

Unstructured

Mar 9, 2024

Identity enabled RAG using Pebblo

Unstructured

LLM