Meta-Prompt for Seamless Unstructured Ingest Pipeline Integration
Nov 24, 2024

Authors

Nina Lopatina
Nina Lopatina
Developer Relations Engineer, Unstructured
Tarun Narayanan
Tarun Narayanan
DevRel Engineer, Unstructured

Effortlessly managing and processing unstructured data is at the heart of modern data pipelines. The Unstructured Ingest Pipeline simplifies this by offering a robust framework for indexing, filtering, partitioning, and embedding unstructured data. But what if you could supercharge your workflows with the precision and adaptability of an LLM? Introducing the Meta-Promptfor the Unstructured Ingest Pipeline — your go-to guide for generating production-ready, API-compatible code and CLI commands with ease.

Let’s explore how Meta-Prompt unlocks the full potential of the Unstructured Ingest Pipeline, offering developers a streamlined way to tackle complex data processing tasks. You can get started with our Serverless API with a free trial for 2 weeks.  

What is a meta-prompt?

Our personalized Meta-Prompt acts as an AI-engineer-in-a-box, guiding you through the entire Unstructured Ingest Pipeline. Whether you're working with local files, cloud storage, or specialized connectors like Azure or S3, Meta-Prompt ensures accurate, efficient, and production-ready implementations. 

Key features include:

  1. Built-in API Context: Avoid hallucinations and ensure LLMs use the correct API calls and settings.
  2. Simplified Implementations: Generate code and CLI commands tailored to your use case, avoiding over-engineered solutions.
  3. No Placeholder Data: Always generates actionable, real-world-ready code.
  4. Connector-Aware Design: Automatically aligns with your source and destination connectors for seamless integration.
  5. Unstructured Ingest Compliance: Leverages all core stages, from indexing to embedding.

Example Use Cases:

  • Local file ingestion with filtering and chunking for structured analysis.
  • Indexing and partitioning PDFs from S3 for downstream RAG systems.
  • Integrating Unstructured into your codebase
  • Writing a workflow to process all of your documents in Azure blob storage, classifying the output to add additional metadata, then writing a new workflow to embed these documents in your destination of choice.

Meta-Prompt in Action

How to use the meta-prompt: First, open your LLM code assistant of choice. We have a full version in our repo that can be used directly in most frontier models (GPT, Claude 3, etc.)

You can copy this prompt for models with a longer context window. If you are using the free version of ChatGPT, or any custom open-source model, we also provide a split version of the Meta Prompt that can be added one message at a time, ensuring that it complies with the model’s context length limitations. Each of these is <5k tokens long. To use this version, copy these three prompts in a row for models with input limitations. 

Once you have pasted these metaprompts in to start, you can try one of our examples below as a follow up prompt that will have you coding with Unstructured’s Serverless API in no time. Note that we were not able to fit every connector in the meta-prompt, so check out our documentation if your prompt doesn’t work.

Scenario 1: Local File Processing

You have a directory of PDFs and want to partition them using the high-resolution strategy, then generate embeddings using Hugging Face models. With Meta-Prompt, you can prompt an LLM to generate this in seconds. In these examples, we are using Claude. Simply type this as your next prompt after the meta-prompt:

[@portabletext/react] Unknown block type "quoteBlock", specify a component for it in the `components.types` prop

Response:

Here's the complete code for a Google Colab notebook that securely handles credentials and processes files from "content" to "output" using OpenAI ada-002 embeddings:

python

Loading...

Note that code generation can still be a little buggy – Claude likes to trail off in the last line, so we just deleted that one. We needed to add a “/” after the content directory, and then running the code, we received a few errors that prompted us to add a few more pip installs. Here is the notebook we successfully ran to process a file, with the tweaks we noted. Note that it may take several prompts to tweak the output that you are seeking, and/or to debug the output code.

Scenario 2

Now we will attempt to write a full RAG pipeline starting with Unstructured. Here is the prompt:

[@portabletext/react] Unknown block type "quoteBlock", specify a component for it in the `components.types` prop

Here is the output (after reminding Claude to continue after a mid-code trail off):

Response:

Yes, I'll continue with the RAG implementation. Here's the complete code:

```python

Loading...

This completes the implementation. Here's what each part does:

  1. build_vector_store(): Creates a FAISS vector store from the processed documents
  2. setup_llm(): Initializes the Llama 2 model for generation
  3. create_rag_chain(): Combines the vector store and LLM into a RAG chain
  4. The main execution block sets everything up and provides an interactive query interface

__________

We did need to debug this one as well, pasting an error message and then adding more pip installs with subsequent errors. The final notebook we ran with those tweaks is here. With a few tweaks, the pipeline ran, but was not well-optimized for speed. Perhaps combining additional meta-prompts and tweaking the prompt would yield better results, but this is a great start for creating a complete RAG pipeline in two prompts!

We also have an example of our meta prompt in action with ChatGPT.

Are you ready to unlock the full potential of your data workflows with RAG? Explore more in the Unstructured Documentation.