Apr 15, 2024
Supercharge RAG Performance Using OctoAI and Unstructured Embeddings
Pedro Torruella (Octo AI) & Ronny Hoesada
LLM
Introduction
The ability to extract valuable insights from text is paramount for businesses and researchers. This journey begins with understanding the power of embeddings, a crucial technique in Artificial Intelligence that transforms text into a format readily interpretable by machines. This blog post will equip you with a foundational understanding of text embeddings using two platforms:
OctoAI delivers a complete stack for app builders to run, tune, and scale their AI applications in either the cloud or on-prem. Their Text Generation solution hosts highly scalable and optimized LLMs and embedding models.
Unstructured is a powerful data management solution designed to handle unstructured data, which typically poses challenges for traditional data processing systems. By effectively managing and extracting value from unstructured data, Unstructured enables organizations to tap into a wealth of untapped insights, enhancing their decision-making capabilities.
Together, these tools provide a comprehensive solution for data analysis challenges, enabling users to derive meaningful insights from vast amounts of information. We will explore how integrating OctoAI's GTE-Large embedding model with Unstructured Embedding functionality can enhance data processing and RAG performance.
Understanding the OctoAI Text Embedding Model
Embeddings in Machine Learning and Natural Language Processing (NLP) refer to categorical or textual data representation as numerical vectors. These vectors capture the semantic relationships between words, phrases, or entire documents, allowing machine learning algorithms to process and understand language data more effectively.
OctoAI provides the General Text Embeddings (GTE) Large Embedding model. Alibaba DAMO Academy has trained the GTE models on a large-scale corpus of relevant text pairs covering a large domain and scenarios. This makes the GTE models suitable for various sophisticated NLP tasks, such as information retrieval, semantic textual similarity, language translation, summarization, and text reranking. All these require a deep understanding of language structure and meaning.
GTE-Large performs very well on the MTEB leaderboard, with an average score of 63.13% (comparable to OpenAI’s text-embedding-ada-002, which scores 61.0%). This embedding model caters predominantly to English text input. Input text is limited to a maximum of 512 tokens (any text longer than that will be truncated), producing an embedding vector of 1024 dimensions. This makes OctoAI GTE Large Embeddings a versatile tool for various NLP projects, from chatbots and virtual assistants to content analysis and recommendation systems.
Handling Unstructured Data
Unstructured provides a data ingestion and processing platform. Businesses deal with abundant unstructured data that does not have a predefined format or organization, making it challenging to process and analyze using traditional methods. Examples of unstructured and semi-structured data include text documents, tables, and images in various file formats, such as PDFs, Excel, or PowerPoint.
Unstructured stands out from other data management solutions due to its unique features and capabilities:
Advanced data processing: State-of-the-art algorithms and machine learning models to extract text information from unstructured data. This includes optical character recognition (OCR), NLP techniques, and Tranformer-based models.
Scalability: Built to handle large volumes of unstructured data via the Unstructured SaaS API and Enterprise Platform, making it suitable for enterprises and organizations with vast information.
Customization: Allows users to define custom data models and extraction rules, ensuring the platform can adapt to specific business needs and use cases.
Integration: This can be easily integrated with other data management systems, AI platforms, and analytics tools, enabling seamless data flow and facilitating end-to-end data processing pipelines.
What We Will Build
This article demonstrates how to work with OctoAI GTE Large Embeddings and Unstructured OctoAI Embedding in a RAG application, together with Pinecone vector database and MistralAI LLM. We will explore their key features, understand their work, and examine their practical applications through code examples.
We develop three use cases as examples, from the basic text embedding using Unstructured and OctoAI Embedding to a complete RAG application capable of processing a PDF file and performing an RAG search on its content. We begin with a basic script demonstrating the use of OctoAI for embeddings with Unstructured. Then, we will enhance this script to showcase an example of processing a PDF file and generating the corresponding embeddings. Lastly, we will build upon the previous examples by uploading the embeddings to the Pinecone vector database with vector search capabilities and utilizing MistralAI to execute the RAG functionality.
At the end of this article, you can build a RAG application with the following workflow:
Code Walkthrough
You can follow the code below on this Google Colab. Before we start, you will need to get the following API keys:
You can create a .env file to store the API keys with the following configuration:
Use Case #1: Unstructured API with OctoAI GTE Embeddings for Simple Text Processing
This code demonstrates how to use the OctoAI embedding engine within the Unstructured library to generate embeddings for text elements:
Here's a step-by-step description of the code:
Import the necessary packages.
Define the embeddings_example function, which generates embeddings for the given text elements using the OctoAI embedding engine.
Inside the embeddings_example function:
Create an OctoAIEmbeddingEncoder instance using the OctoAiEmbeddingConfig class and the OctoAI API key retrieved from the application configuration.
Define two text elements as examples stored in the elements list.
Generate embeddings for the text elements using the embed_documents method of the OctoAIEmbeddingEncoder instance.
Return the embedded elements.
Check if the script runs as the main module (__name__ == "__main__"). If so, call the embeddings_example function and store the result in the _embedded_elements variable.
Print the embeddings and the corresponding text elements by iterating through the _embedded_elements list and using a list comprehension with the print function. The embeddings are NumPy arrays, while the text elements are represented as Text objects.
Use Case #2: Process a PDF Document with Unstructured API and OctoAI Embeddings
This code demonstrates how to use the Unstructured library in combination with the OctoAI embedding engine to process a PDF document, extract text elements, and generate embeddings for those elements:
This example uses the PDF file from the Unstructured GitHub examples folder at https://github.com/Unstructured-IO/unstructured/tree/main/example-docs.
Here's a step-by-step description of the code:
Import the necessary packages.
Define the process_pdf function, which processes a PDF document, extracts text elements, and generates embeddings for those elements using the OctoAI embedding engine.
Inside the process_pdf function:
Create an UnstructuredClient instance using the Unstructured API key retrieved from the application configuration.
Define the PDF file to process and open it in binary mode.
Define the partition parameters, specifying the file content, file name, and partitioning strategy.
Partition the document using the partition method of the UnstructuredClient instance.
Process the resulting elements, creating Text objects for elements containing text and storing them in the elements list along with their metadata (file name and page number).
Create an OctoAIEmbeddingEncoder instance using the OctoAiEmbeddingConfig class and the OctoAI API key retrieved from the application configuration.
Generate embeddings for the text elements using the embed_documents method of the OctoAIEmbeddingEncoder instance.
Return the embedded elements and the embedding encoder.
Check if the script runs as the main module (__name__ == "__main__"). If so, call the process_pdf function and store the result in the _embedded_elements and _embedding_encoder variables.
Print the embeddings and the corresponding text elements by iterating through the _embedded_elements list and using a for loop with the print function. The embeddings are NumPy arrays, while the text elements are represented as Text objects.
Use Case #3: Build a Full RAG Application
This code demonstrates how to use Pinecone, Mistral AI, and the previously processed PDF data to create a vector search index, generate contextually relevant responses to user queries, and interact with a large language model.
This example builds on the previous one and assumes you stored the previous code in a process_pdf.py file:
Here's a step-by-step description of the code:
Import the necessary packages and:
process_pdf: A function imported from the process_pdf module used to process a PDF document, extract text elements, and generate embeddings for those elements using the OctoAI embedding engine. Defined in the previous code example.
Define the pinecone_index function, which creates a Pinecone index, initializes it, and inserts the embedded elements into it.
Inside the pinecone_index function:
Create a Pinecone client instance using the Pinecone API key retrieved from the application configuration.
Create an index with the specified name, dimension, metric, and specification if it doesn't already exist.
Wait for the index to be initialized and connect to it.
Insert the embedded elements into the index, including their embeddings, text, and metadata.
Return the index.
Define the rag function, which takes a user query, an embedding encoder, and a Pinecone index as input and generates a contextually relevant response using Mistral AI.
Inside the rag function:
Create a query embedding using the embedding encoder.
Retrieve relevant contexts from the Pinecone index based on the query embedding.
Format the retrieved contexts into a single string.
Create a prompt that includes the contexts and the user query.
Prepare a chat request with Mistral AI, including the system prompt and the user query.
Execute the chat request using the Mistral AI client and retrieve the response.
Return the answer from the Mistral AI response.
Check if the script runs as the main module (__name__ == "__main__"). If so:
Call the process_pdf function to process the PDF document and generate embedded elements and an embedding encoder.
Create a Pinecone index using the pinecone_index function, embedded elements, and embedding encoder.
Define a user query.
Call the rag function with the user query, embedding encoder, and Pinecone index to generate a contextually relevant response.
Print the answer.
Let’s see the output of this RAG example:
The Layout Parser library is used as a unified toolkit for deep learning-based document image analysis and processing. It provides a set of simple and intuitive interfaces for applying and customizing DL models for tasks such as layout detection, character recognition, and other document processing tasks. The library is designed to streamline the usage of DL in DIA research and applications, making it accessible to a wide audience including researchers and industry professionals. It also includes comprehensive tools for efficient document image data annotation and model tuning to support different levels of customization.
Conclusion
The synergy between OctoAI's GTE Large Embeddings and Unstructured.io unlocks new possibilities and improvements for advanced NLP tasks, including understanding and retrieving complex documents for an RAG application.
To take the first step towards transforming your data analysis processes and unleashing the full potential of your unstructured data, we encourage you to sign up for an OctoAI API Key and an Unstructured API Key today. We invite you to join the Unstructured Slack Community to collaborate with like-minded individuals, receive direct support, and stay informed about the latest advancements in this exciting field.