Scarf analytics pixel

Dec 6, 2024

Build an end-to-end RAG chatbot for your personal e-book collection using the Unstructured Platform and MongoDB

Tarun Narayanan

RAG

In this guide, we’ll walk you through building a Retrieval-Augmented Generation (RAG) application that serves as your personal AI Librarian, helping you explore your digital book collection. You'll learn how to construct an unstructured data ETL pipeline for EPUB files using the Unstructured platform, leverage MongoDB Atlas as a vector store and search index, and orchestrate the RAG workflow with LangChain. Additionally, you'll develop an intuitive and interactive user interface using Streamlit, powered by a local llama3.1 model through Ollama. By the end of this tutorial, you’ll have a fully functional RAG application capable of engaging with your digital library, delivering precise and instant responses.

In this guide, we’ll walk you through building a Retrieval-Augmented Generation (RAG) application that serves as your personal AI Librarian, helping you explore your digital book collection. You'll learn how to construct an unstructured data ETL pipeline for EPUB files using the Unstructured platform, leverage MongoDB Atlas as a vector store and search index, and orchestrate the RAG workflow with LangChain. Additionally, you'll develop an intuitive and interactive user interface using Streamlit, powered by a local llama3.1 model through Ollama. By the end of this tutorial, you’ll have a fully functional RAG application capable of engaging with your digital library, delivering precise and instant responses.

Prerequisites: 

  • Unstructured Platform access - More details HERE

  • MongoDB account, a MongoDB Atlas cluster, and your MongoDB connection string (uri). Check out our documentation to learn how to prepare your MongoDB collection as a destination. 

Find the code in this GitHub repo to follow along. 

Switching from Serverless to Platform

If you’ve been using the Unstructured Serverless Ingest API, you’ll find all the features you know and love in a streamlined no-code interface. And if you haven’t tried Unstructured before, this is your chance to get up and running in minutes! Platform is a no-code evolution designed to empower even non-technical users with access to our world-class ETL solutions. Your data is processed seamlessly on Unstructured-hosted compute resources, removing the complexities of server management.

Unstructured Platform’s philosophy is simple: set it and forget it. Schedule your ETL pipelines, and let the Platform handle the rest. Plus, it comes packed with cutting-edge features, like image and table summaries, that enhance precision and elevate the performance of your RAG applications. We’re shipping new features fast, ensuring the Platform stays ahead of your needs. It’s a game-changer in simplicity and performance—are you ready to make the switch? Follow along in this blog post to see the steps to set up Platform to process all of your unstructured data files. 

Unstructured Data ETL Pipeline

Every Retrieval-Augmented Generation (RAG) application begins with data: a comprehensive, relevant, and up-to-date knowledge base to provide the chatbot with essential context. For this tutorial, we want our AI Librarian to access a personal book collection stored locally in the common EPUB format. If you don’t already have books to use, you can download more than 70,000 free digital books from Project Gutenberg.

In this demo, the Unstructured Platform plays a critical role in preparing document data for ingestion into the knowledge graph. Unstructured's robust parsing and data extraction capabilities allow it to process diverse document formats—such as PDFs, HTML, or plain text—and produce structured outputs that preserve the semantic relationships within the content. To prepare these books for use in the app, we’ll construct an ETL (Extract, Transform, Load) pipeline on the Platform. In just a few clicks, our pipeline will:

  1. Extract the content of the EPUB books into document elements.

  2. Transform the document elements by chunking them into manageable text segments (read more about chunking here).

  3. Load the transformed data by creating vector representations of the text chunks using an embedding model and storing them in a vector database for retrieval.

In this tutorial, MongoDB Atlas will serve as the vector store, housing the processed book data for efficient querying and retrieval. Let’s dive in and get started!

Step 1: Fetching documents from our source (S3) with the Unstructured Platform

We start by creating a data source so Unstructured can access our documents. On the left sidebar select Connectors > Source > New and follow the setup instructions in our S3 source documentation

Next, we specify the S3 credentials to establish the connection. 

Step 2: Setting up our Destination Connector following our MongoDB destination documentation 

Step 3: Construct your workflow for document processing

Select Workflows > New Workflow > Build it with Me

Select the source and destination connectors that we just created:

Then select custom to customize your workflow. 

We are using the fast strategy since we are working with epub documents. 


Let’s quickly review the parameters in this workflow creation form and what they mean:

Strategy: 

  • Basic / Fast is ideal for simple, text-only documents.

  • Advanced / High Res is best for PDFs, images, and complex file types.

  • Platinum / VLM is for challenging documents, including scanned and handwritten content.

Image Summarizers:

  • Generates a description of the image content

  • Example output includes details like image type, description of visual elements, and relevant metadata

Table Summarizers:

  • Provides a textual summary of table contents

  • Example output includes information about table structure, data trends, and key insights

Chunking Types:

  • Basic combines sequential elements up to specified size limits. Oversized elements are split, while tables are isolated and divided if necessary. Overlap between chunks is optional.

  • By Title uses semantic chunking, understands the layout of the document, and makes intelligent splits.

  • By Page attempts to preserve page boundaries when determining the chunks’ contents.

  • By Similarity uses an embedding model to identify topically similar sequential elements and combines them into chunks.

Embed

  • The Unstructured Platform uses optional third-party embedding providers such as OpenAI to generate semantically relevant embeddings on the extracted text.

This is the recommended way of setting up a workflow - granularly tuning every setting based on your data. If you want something quicker, you could also select one of the preset options available on the platform fast, hi-res and VLM to get started.

Finally, select the schedule at which you want to run this workflow. 

Our workflow is ready! 

On the workflows page you should find the workflow you just created Workflow 7aeaf like this:

Simply hit "Run" to kick off the ETL workflow, and watch as your documents are processed automatically from start to finish.

You can monitor the progress of your workflow under the jobs pane. 

At this point the data is preprocessed and loaded, we can set up the retriever for the chatbot. 

Setting up the RAG pipeline

MongoDB Retriever

LangChain integrates seamlessly with MongoDB, allowing you to create retrievers based on an existing Vector Search Index. Here’s how you can set up a LangChain retriever using the index we just created:

def get_retriever():
    vectorstore = MongoDBAtlasVectorSearch.from_connection_string(
        connection_string=os.getenv("MONGODB_URI"),
        namespace="rag-e2e.books-coll-2",
        embedding=HuggingFaceEmbeddings(model_name=os.getenv("EMBEDDING_MODEL")),
        text_key="text",
        embedding_key="embeddings",
    )
    return vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

Important: Ensure you use the exact same embedding model here as you did when preprocessing your documents. Consistency is key to accurate retrieval results.

Building a LangChain-Powered RAG Application

Now, let’s bring it all together by setting up a RAG chain using LangChain to orchestrate the process.

We’ll use the advanced Llama 3.1:8b model by Meta AI, available via Ollama, for the language model.

Key steps include:

  1. Defining a System Prompt: The prompt will guide the model to act as a well-informed AI Librarian, leveraging context to provide accurate responses to user queries.

  2. Enabling Memory: This feature allows for multi-turn conversations, maintaining context across interactions.

  3. Contextualizing User Questions: We’ll incorporate a mechanism to rephrase user queries, integrating them with the conversation history for more informed answers.

With these components in place, you’ll have a responsive and context-aware RAG application ready to handle queries about your digital book collection!

def get_chain():
   retriever = get_retriever()
   local_model = "llama3.1:8b"
   model = ChatOllama(model=local_model,
                      num_predict=500,
                      stop=["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"])
   system_prompt = """
   <|start_header_id|>user<|end_header_id|>
   You are a helpful and knowledgeable AI Librarian. Use the following context and the user's chat history to
   help the user. If you don't know the answer, just say that you don't know. 
   Context: {context}
   Question: {question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
   """
   rag_prompt = ChatPromptTemplate.from_messages(
       [
           ("system", system_prompt),
           ("human", "{question}"),
       ]
   )
   rag_chain = (
           {
               "context": RunnableLambda(get_question) | retriever | format_docs,
               "question": RunnablePassthrough()
           }
           | rag_prompt
           | model
   )
   contextualize_q_system_prompt = """Given a chat history and the latest user question \
           which might reference context in the chat history, formulate a standalone question \
           which can be understood without the chat history. Do NOT answer the question, \
           just reformulate it if needed and otherwise return it as is."""
   contextualize_q_prompt = ChatPromptTemplate.from_messages(
       [
           ("system", contextualize_q_system_prompt),
           MessagesPlaceholder(variable_name="chat_history"),
           ("human", "{question}

Streamlit UI for Your Digital Library Assistant

The final step in building your app is to create an engaging and intuitive user interface using Streamlit.

Setting Up the App Title

To begin, configure the app's title and set the page title:

import streamlit as st 
st.set_page_config(page_title="Personal Digital Library Assistant") st.title("Personal Digital Library Assistant")

Designing the UI Function

We’ll define a function, show_ui, to manage the app's interface. This includes initializing session state, displaying chat messages, and handling user inputs.

def show_ui(qa, prompt_to_user="How may I help you?"):
    # Initialize session state to hold chat messages
    if "messages" not in st.session_state.keys():
        st.session_state.messages = [{"role": "assistant", "content": prompt_to_user}]
    
    # Display all existing chat messages
    for message in st.session_state.messages:
        with st.chat_message(message["role"]):
            st.write(message["content"])
    
    # Capture user input
    if prompt := st.chat_input():
        st.session_state.messages.append({"role": "user", "content": prompt})
        with st.chat_message("user"):
            st.write(prompt)
    
    # Generate and display a response if the last message is from the user
    if st.session_state.messages[-1]["role"] != "assistant":
        with st.chat_message("assistant"):
            with st.spinner("Flipping pages..."):
                response = ask_question(qa, prompt)
                st.markdown(response.content)
        # Save assistant's response to the session state
        st.session_state.messages.append({"role": "assistant", "content": response.content})

Running the App

To start the application, navigate to the folder containing your streamlit-app.py file in your terminal and run:

streamlit run streamlit-app.py

Trying It Locally

If you'd like to test the app on your own machine:

  1. Clone the repository from GitHub.

  2. Create a .env file in the project root directory and add your secret keys and configurations to it.

With these steps, you’ll have a fully functional UI for your Personal Digital Library Assistant, ready to interact with users and provide valuable insights from your digital book collection!

If you want to try this with your own epub collection, check out our AI Librarian repository and sign up for Platform to get started!  

Keep Reading

Keep Reading

Recent Stories

Recent Stories

Jan 16, 2025

Enterprise RAG: Why Connectors Matter in Production Systems

Unstructured

RAG

Jan 16, 2025

Enterprise RAG: Why Connectors Matter in Production Systems

Unstructured

RAG

Jan 16, 2025

Enterprise RAG: Why Connectors Matter in Production Systems

Unstructured

RAG

Dec 29, 2024

Transform files in S3 to Pinecone with Unstructured Platform with no code

Nina Lopatina

Unstructured

Dec 29, 2024

Transform files in S3 to Pinecone with Unstructured Platform with no code

Nina Lopatina

Unstructured

Dec 29, 2024

Transform files in S3 to Pinecone with Unstructured Platform with no code

Nina Lopatina

Unstructured

Dec 18, 2024

Introducing Unstructured Platform API for Programmatic Data Transformation

Unstructured

Unstructured

Dec 18, 2024

Introducing Unstructured Platform API for Programmatic Data Transformation

Unstructured

Unstructured

Dec 18, 2024

Introducing Unstructured Platform API for Programmatic Data Transformation

Unstructured

Unstructured