Scarf analytics pixel

Apr 15, 2024

Supercharge RAG Performance Using OctoAI and Unstructured Embeddings

Pedro Torruella (Octo AI) & Ronny Hoesada

LLM

Introduction

The ability to extract valuable insights from text is paramount for businesses and researchers. This journey begins with understanding the power of embeddings, a crucial technique in Artificial Intelligence that transforms text into a format readily interpretable by machines. This blog post will equip you with a foundational understanding of text embeddings using two platforms:

  • OctoAI delivers a complete stack for app builders to run, tune, and scale their AI applications in either the cloud or on-prem. Their Text Generation solution hosts highly scalable and optimized LLMs and embedding models.

  • Unstructured is a powerful data management solution designed to handle unstructured data, which typically poses challenges for traditional data processing systems. By effectively managing and extracting value from unstructured data, Unstructured enables organizations to tap into a wealth of untapped insights, enhancing their decision-making capabilities.

Together, these tools provide a comprehensive solution for data analysis challenges, enabling users to derive meaningful insights from vast amounts of information. We will explore how integrating OctoAI's GTE-Large embedding model with Unstructured Embedding functionality can enhance data processing and RAG performance.

Understanding the OctoAI Text Embedding Model

Embeddings in Machine Learning and Natural Language Processing (NLP) refer to categorical or textual data representation as numerical vectors. These vectors capture the semantic relationships between words, phrases, or entire documents, allowing machine learning algorithms to process and understand language data more effectively.

OctoAI provides the General Text Embeddings (GTE) Large Embedding model. Alibaba DAMO Academy has trained the GTE models on a large-scale corpus of relevant text pairs covering a large domain and scenarios. This makes the GTE models suitable for various sophisticated NLP tasks, such as information retrieval, semantic textual similarity, language translation, summarization, and text reranking.  All these require a deep understanding of language structure and meaning.

GTE-Large performs very well on the MTEB leaderboard, with an average score of 63.13% (comparable to OpenAI’s text-embedding-ada-002, which scores 61.0%). This embedding model caters predominantly to English text input. Input text is limited to a maximum of 512 tokens (any text longer than that will be truncated), producing an embedding vector of 1024 dimensions. This makes OctoAI GTE Large Embeddings a versatile tool for various NLP projects, from chatbots and virtual assistants to content analysis and recommendation systems.

Handling Unstructured Data

Unstructured provides a data ingestion and processing platform. Businesses deal with abundant unstructured data that does not have a predefined format or organization, making it challenging to process and analyze using traditional methods. Examples of unstructured and semi-structured data include text documents, tables, and images in various file formats, such as PDFs, Excel, or PowerPoint.

Unstructured stands out from other data management solutions due to its unique features and capabilities:

  • Advanced data processing: State-of-the-art algorithms and machine learning models to extract text information from unstructured data. This includes optical character recognition (OCR), NLP techniques, and Tranformer-based models.

  • Scalability: Built to handle large volumes of unstructured data via the Unstructured SaaS API and Enterprise Platform, making it suitable for enterprises and organizations with vast information.

  • Customization: Allows users to define custom data models and extraction rules, ensuring the platform can adapt to specific business needs and use cases.

  • Integration: This can be easily integrated with other data management systems, AI platforms, and analytics tools, enabling seamless data flow and facilitating end-to-end data processing pipelines.

What We Will Build

This article demonstrates how to work with OctoAI GTE Large Embeddings and Unstructured OctoAI Embedding in a RAG application, together with Pinecone vector database and MistralAI LLM. We will explore their key features, understand their work, and examine their practical applications through code examples.

We develop three use cases as examples, from the basic text embedding using Unstructured and OctoAI Embedding to a complete RAG application capable of processing a PDF file and performing an RAG search on its content. We begin with a basic script demonstrating the use of OctoAI for embeddings with Unstructured. Then, we will enhance this script to showcase an example of processing a PDF file and generating the corresponding embeddings. Lastly, we will build upon the previous examples by uploading the embeddings to the Pinecone vector database with vector search capabilities and utilizing MistralAI to execute the RAG functionality.

At the end of this article, you can build a RAG application with the following workflow:

Code Walkthrough

You can follow the code below on this Google Colab. Before we start, you will need to get the following API keys:

You can create a .env file to store the API keys with the following configuration:

OCTOAI_API_KEY=<YOUR_OCTOAI_API_KEY> 
UNSTRUCTURED_API_KEY=<YOUR_UNSTRUCTURED_API_KEY> 
PINECONE_API_KEY=<YOUR_PINECONE_API_KEY> 
MISTRAL_API_KEY=<YOUR_MISTRAL_API_KEY>

Use Case #1: Unstructured API with OctoAI GTE Embeddings for Simple Text Processing

This code demonstrates how to use the OctoAI embedding engine within the Unstructured library to generate embeddings for text elements:

# Import the necessary packages  
from decouple import config
from unstructured.documents.elements import Text
from unstructured.embed.octoai import OctoAiEmbeddingConfig, OctoAIEmbeddingEncoder

# Define the embeddings_example function
def embeddings_example():   
  # Create an OctoAIEmbeddingEncoder   
  print("Creating OctoAIEmbeddingEncoder")   
  embedding_encoder = OctoAIEmbeddingEncoder(       
    config=OctoAiEmbeddingConfig(api_key=config("OCTOAI_API_KEY"))   
  )   
  # Define the elements to embed, here we have two sentences   
  elements = [Text("This is sentence 1"), Text("This is sentence 2")]   
  # Embed the elements   
  print("Embedding the elements")   
  embedded_elements = embedding_encoder.embed_documents(       
    elements=elements   
  )   
  # Return the embedded elements   
  return embedded_elements
    
if __name__ == "__main__":
  _embedded_elements = embeddings_example()   
  # Print the embedded elements   
  [print(e.embeddings, e, "\n") for e in _embedded_elements]

Here's a step-by-step description of the code:

  • Import the necessary packages.

  • Define the embeddings_example function, which generates embeddings for the given text elements using the OctoAI embedding engine.

  • Inside the embeddings_example function:

    • Create an OctoAIEmbeddingEncoder instance using the OctoAiEmbeddingConfig class and the OctoAI API key retrieved from the application configuration.

    • Define two text elements as examples stored in the elements list.

    • Generate embeddings for the text elements using the embed_documents method of the OctoAIEmbeddingEncoder instance.

    • Return the embedded elements.

  • Check if the script runs as the main module (__name__ == "__main__"). If so, call the embeddings_example function and store the result in the _embedded_elements variable.

  • Print the embeddings and the corresponding text elements by iterating through the _embedded_elements list and using a list comprehension with the print function. The embeddings are NumPy arrays, while the text elements are represented as Text objects.

Use Case #2: Process a PDF Document with Unstructured API and OctoAI Embeddings

This code demonstrates how to use the Unstructured library in combination with the OctoAI embedding engine to process a PDF document, extract text elements, and generate embeddings for those elements:

# Import the necessary packages
from decouple import config
from unstructured.documents.elements import Text
from unstructured.embed.octoai import OctoAiEmbeddingConfig, OctoAIEmbeddingEncoder
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
  
# Process the PDF function
def process_pdf():
  # Create an UnstructuredClient
  print("Creating UnstructuredClient")
  client = UnstructuredClient(api_key_auth=config("UNSTRUCTURED_API_KEY"),
                              server_url=config("UNSTRUCTURED_SERVER_URL")
                             )

  # Define the file to process and open it
  filename = "layout-parser-paper-fast.pdf"
  file = open(filename, "rb")
    
  # Define the partition parameters
  print("Defining the partition parameters")
  request = shared.PartitionParameters(
    # Note that this currently only supports a single file
    files=shared.Files(
      content=file.read(),
      file_name=filename,
    ),
    # Other partition params
    strategy="fast",
  )
      
  # Partition the document
  print("Partitioning the document")
  result = client.general.partition(request)
    
  # Process the resulting elements
  print("Processing the resulting elements")
    
  elements = []
  
  for element in result.elements:
    if element['text']:
      text = Text(text=element['text'])
      text.metadata.filename = element['metadata']['filename']
      text.metadata.page_number = element['metadata']['page_number']
      elements.append(text)
        
  # Create an OctoAIEmbeddingEncoder
  print("Creating OctoAIEmbeddingEncoder")
  embedding_encoder = OctoAIEmbeddingEncoder(
    config=OctoAiEmbeddingConfig(api_key=config("OCTOAI_API_KEY"))
  )
    
  # Embed the elements
  print("Embedding the elements")
  embedded_elements = embedding_encoder.embed_documents(
    elements=elements
  )
    
  # Return the embedded elements and the embedding encoder
  return embedded_elements, embedding_encoder
    
if __name__ == "__main__":
  _embedded_elements, _embedding_encoder = process_pdf()
  # Print the embedded elements
  for e in _embedded_elements:
    print(e.embeddings, e, "\n")

This example uses the PDF file from the Unstructured GitHub examples folder at https://github.com/Unstructured-IO/unstructured/tree/main/example-docs.

Here's a step-by-step description of the code:

  • Import the necessary packages.

  • Define the process_pdf function, which processes a PDF document, extracts text elements, and generates embeddings for those elements using the OctoAI embedding engine.

  • Inside the process_pdf function:

    • Create an UnstructuredClient instance using the Unstructured API key retrieved from the application configuration.

    • Define the PDF file to process and open it in binary mode.

    • Define the partition parameters, specifying the file content, file name, and partitioning strategy.

    • Partition the document using the partition method of the UnstructuredClient instance.

    • Process the resulting elements, creating Text objects for elements containing text and storing them in the elements list along with their metadata (file name and page number).

    • Create an OctoAIEmbeddingEncoder instance using the OctoAiEmbeddingConfig class and the OctoAI API key retrieved from the application configuration.

    • Generate embeddings for the text elements using the embed_documents method of the OctoAIEmbeddingEncoder instance.

    • Return the embedded elements and the embedding encoder.

  • Check if the script runs as the main module (__name__ == "__main__"). If so, call the process_pdf function and store the result in the _embedded_elements and _embedding_encoder variables.

  • Print the embeddings and the corresponding text elements by iterating through the _embedded_elements list and using a for loop with the print function. The embeddings are NumPy arrays, while the text elements are represented as Text objects.

Use Case #3: Build a Full RAG Application

This code demonstrates how to use Pinecone, Mistral AI, and the previously processed PDF data to create a vector search index, generate contextually relevant responses to user queries, and interact with a large language model.

This example builds on the previous one and assumes you stored the previous code in a process_pdf.py file:

# Import the necessary packages
import time
from decouple import config
from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage
from pinecone import Pinecone, ServerlessSpec, PodSpec
from process_pdf import process_pdf
  
# Define the pinecone_index function
def pinecone_index(embedded_elements, embedding_encoder):
  # Create a Pinecone client
  print("Creating Pinecone client")
  pc = Pinecone(api_key=config("PINECONE_API_KEY"))
    
  # Create an index
  print("Creating index")
  index_name = "unstructured"
    
  if index_name not in pc.list_indexes().names():
    pc.create_index(
      name=index_name,
      dimension=embedding_encoder.num_of_dimensions()[0],
      metric="cosine",
      spec=PodSpec(environment="gcp-starter")
    )
      
  # Wait for index to be initialized
  print("Waiting for index to be initialized")
    
  while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)
      
  # Connect to index
  print("Connecting to index")
  index = pc.Index(index_name)
    
  # Insert the embedded elements into the index
  print("Inserting embedded elements into the index")
    
  for e in embedded_elements:
    index.upsert(
      vectors=[
        {
          "id": e.id,
          "values": e.embeddings,
          "metadata": dict({'text': e.text}, **e.metadata.to_dict())  # Add text and metadata
         }
       ]
    )

  # Return the index
  return index
    
# Define the RAG function
def rag(query, embedding_encoder, index):
  # Create query embedding
  print("Creating query embedding")
  query_embedding = embedding_encoder.embed_query(query=_query)
    
  # Get relevant contexts from Pinecone
  print("Getting relevant contexts from Pinecone")
  search_result = index.query(vector=query_embedding, top_k=10, include_metadata=True)
    
  # Get a list of retrieved texts
  contexts = [x['metadata']['text'] for x in search_result['matches']]
  context_str = "\n".join(contexts)
    
  # Place contexts into RAG prompt
  prompt = f"""You are a helpful assistant, below is a query from a user and
    some relevant contexts. Answer the question given the information in those
    contexts. If you cannot find the answer to the question, say "I don't know".
    
    Contexts:   {context_str}
    """
    
    # Prepare the chat request with Mistral AI, with the system prompt and the user query
    print("Preparing the chat request")
    model = "open-mixtral-8x7b"
    client = MistralClient(api_key=config("MISTRAL_API_KEY"))
    messages = [
      ChatMessage(role="system", content=prompt),
      ChatMessage(role="user", content=f"""Query: {query} Answer: """)
    ]
    
  # Execute the chat request
  print("Executing the chat request")
  chat_response = client.chat(
    model=model,
    messages=messages,
  )
    
  # Return the answer
  return chat_response.choices[0].message.content
    
if __name__ == "__main__":
  _embedded_elements, _embedding_encoder = process_pdf()
  _index = pinecone_index(_embedded_elements, _embedding_encoder)
  _query = "What is the Layout Parser library used for?"
  _answer = rag(_query, _embedding_encoder, _index)
  print(_answer)

Here's a step-by-step description of the code:

  • Import the necessary packages and:

    • process_pdf: A function imported from the process_pdf module used to process a PDF document, extract text elements, and generate embeddings for those elements using the OctoAI embedding engine. Defined in the previous code example.

  • Define the pinecone_index function, which creates a Pinecone index, initializes it, and inserts the embedded elements into it.

  • Inside the pinecone_index function:

    • Create a Pinecone client instance using the Pinecone API key retrieved from the application configuration.

    • Create an index with the specified name, dimension, metric, and specification if it doesn't already exist.

    • Wait for the index to be initialized and connect to it.

    • Insert the embedded elements into the index, including their embeddings, text, and metadata.

    • Return the index.

  • Define the rag function, which takes a user query, an embedding encoder, and a Pinecone index as input and generates a contextually relevant response using Mistral AI.

  • Inside the rag function:

    • Create a query embedding using the embedding encoder.

    • Retrieve relevant contexts from the Pinecone index based on the query embedding.

    • Format the retrieved contexts into a single string.

    • Create a prompt that includes the contexts and the user query.

    • Prepare a chat request with Mistral AI, including the system prompt and the user query.

    • Execute the chat request using the Mistral AI client and retrieve the response.

    • Return the answer from the Mistral AI response.

  • Check if the script runs as the main module (__name__ == "__main__"). If so:

    • Call the process_pdf function to process the PDF document and generate embedded elements and an embedding encoder.

    • Create a Pinecone index using the pinecone_index function, embedded elements, and embedding encoder.

    • Define a user query.

    • Call the rag function with the user query, embedding encoder, and Pinecone index to generate a contextually relevant response.

    • Print the answer.

Let’s see the output of this RAG example:

The Layout Parser library is used as a unified toolkit for deep learning-based document image analysis and processing. It provides a set of simple and intuitive interfaces for applying and customizing DL models for tasks such as layout detection, character recognition, and other document processing tasks. The library is designed to streamline the usage of DL in DIA research and applications, making it accessible to a wide audience including researchers and industry professionals. It also includes comprehensive tools for efficient document image data annotation and model tuning to support different levels of customization.

Conclusion

The synergy between OctoAI's GTE Large Embeddings and Unstructured.io unlocks new possibilities and improvements for advanced NLP tasks, including understanding and retrieving complex documents for an RAG application.

To take the first step towards transforming your data analysis processes and unleashing the full potential of your unstructured data, we encourage you to sign up for an OctoAI API Key and an Unstructured API Key today. We invite you to join the Unstructured Slack Community to collaborate with like-minded individuals, receive direct support, and stay informed about the latest advancements in this exciting field.

Keep Reading

Keep Reading

Recent Stories

Recent Stories

Jan 16, 2025

Enterprise RAG: Why Connectors Matter in Production Systems

Unstructured

RAG

Jan 16, 2025

Enterprise RAG: Why Connectors Matter in Production Systems

Unstructured

RAG

Jan 16, 2025

Enterprise RAG: Why Connectors Matter in Production Systems

Unstructured

RAG

Dec 29, 2024

Transform files in S3 to Pinecone with Unstructured Platform with no code

Nina Lopatina

Unstructured

Dec 29, 2024

Transform files in S3 to Pinecone with Unstructured Platform with no code

Nina Lopatina

Unstructured

Dec 29, 2024

Transform files in S3 to Pinecone with Unstructured Platform with no code

Nina Lopatina

Unstructured

Dec 18, 2024

Introducing Unstructured Platform API for Programmatic Data Transformation

Unstructured

Unstructured

Dec 18, 2024

Introducing Unstructured Platform API for Programmatic Data Transformation

Unstructured

Unstructured

Dec 18, 2024

Introducing Unstructured Platform API for Programmatic Data Transformation

Unstructured

Unstructured