Scarf analytics pixel

Mar 24, 2025

Building a No-Code Data Preprocessing Pipeline with Firecrawl and Unstructured MCP

Ajay Krishnan

RAG

The Model Context Protocol (MCP) came out in November, and lately it's been getting a lot more attention. Anthropic announced MCP as a new standard for connecting AI assistants to the systems like content repositories, business tools, and development environments.

The architecture is straightforward: developers can either expose their functionality through MCP servers or build AI applications (MCP clients) that connect to these servers. Applications like Claude Desktop can then integrate an MCP client and leverage the functionality of a MCP server.

In this blog post, we’ll show you how you can use Unstructured MCP integration, which recently added Firecrawl support, to pull data from a website, partition, enrich, and chunk it to make it searchable—without writing any code.

Here’s what we can do with Unstructured MCP:

1) Crawl a website with Firecrawl.

2) Store the crawled data on Amazon S3.

3) Use the Unstructured API to process data from an S3 bucket into AstraDB.

When done, you’ll be able to easily query the indexed data and ask questions via RAG. 

Let’s get started!

Step 1: Set up the MCP Server locally

Clone the UNS-MCP repo. Create a .env file and these keys to your environment, you can also find an example .env.template here:

UNSTRUCTURED_API_KEY="<your-key>"
FIRECRAWL_API_KEY="<your-key>"
ASTRA_DB_APPLICATION_TOKEN="<astra-token>"
ASTRA_DB_API_ENDPOINT="<astra-endpoint>"

Configure Claude desktop to discover your MCP server. 

1) Go to ~/Library/Application Support/Claude/ and create a claude_desktop_config.json.

2) In that file add the following (example here):

{
    "mcpServers":
    {
        "UNS_MCP":
        {
            "command": "ABSOLUTE/PATH/TO/.local/bin/uv",
            "args":
            [
                "--directory",
                "ABSOLUTE/PATH/TO/UNS-MCP",
                "run",
                "server.py"
            ],
            "disabled": false
        }
    }
}

Restart Claude Desktop. 

Once your environment has been set and claude_desktop_config.json is configured, you begin interacting with our newly added MCP Server.

Step 2: Crawl a Website with Firecrawl

Simply ask Claude in English to start crawling your website, for example, https://modelcontextprotocol.io/. Firecrawl handles everything automatically. Here, even though we requested 50 URLs, it smartly fetched 23 pages that actually existed.

Based on the docstrings provided for that particular function you may be prompted to provide values for other parameters. 

Since the context is maintained, it is very easy for Claude to pick up the right id and query and check for more information, as shown here when I wanted to check the status of my crawl job.

Step 3: Set up Data Connectors for Data Processing

Now we can move on to the next step: using Unstructured API capabilities via MCP.  Most other MCP servers tend to connect and access individual data sources. Unstructured, on the other hand, has multiple source and destination connectors built in. The Unstructured MCP server is rapidly catching up with all the source/destination connectors that Unstructured Platform supports - learn more about supported sources here, and destinations - here.

For this demo, let's create a source connector with the S3 bucket, where we've stored our crawled data, and a destination connector for Astra DB to leverage its vector search capabilities.

Step 4: Create and Run a Data Processing Workflow

Prompt Claude to create and run a workflow for you with Unstructured and it will handle it. 

See how after I set up the source and destination connectors, Claude suggested using them for the workflow, which was exactly what I needed! There were also times when Claude didn't know the correct format for a workflow. When that happened, it searched for existing workflows and then created a working configuration from one all by itself.

Now that the workflow is running, you can login to the Unstructured Platform to see the most recent job for that workflow.  

Or you can ask Claude to do this for you:

Step 5: RAG over Crawled and Processed Data

The data has been neatly ingested into the AstraDB with practically 0 coding effort and all through natural language text! Lets set up a simple RAG application to learn about MCP (remember, we crawled the MCP documentation 😉).  

You can find the full example in this notebook, let’s walk through the snippets here.  

!pip install --upgrade -q astrapy

Astrapy is our only dependency, we need it to access Astra DB.

Next, we set up the keys for Astra DB and OpenAI for the RAG application.

import os
from google.colab import userdata
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = userdata.get('ASTRA_DB_APPLICATION_TOKEN')
os.environ["ASTRA_DB_API_ENDPOINT"] = userdata.get('ASTRA_DB_API_ENDPOINT')
os.environ["ASTRA_DB_COLLECTION_NAME"] = userdata.get('ASTRA_DB_COLLECTION_NAME')
os.environ["ASTRA_DB_KEYSPACE"] = userdata.get('ASTRA_DB_KEYSPACE')
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

We need a function to connect to the DB and a function to fetch the embedding for a query. 

from IPython.display import Markdown, display
from astrapy import DataAPIClient
from openai import OpenAI
OPENAI_CLIENT = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
EMBEDDING_MODEL = "text-embedding-3-large"
GENERATION_MODEL = "gpt-4o-2024-11-20"
def get_collection(collection_name: str, keyspace: str):
    """
    Establish connection to Astra DB and return the specified collection.
    Args:
        collection_name (str): Name of the collection to retrieve
        keyspace (str): Database keyspace
    Returns:
        Collection object from Astra DB
    """
    astra_client = DataAPIClient(os.getenv("ASTRA_DB_APPLICATION_TOKEN"))
    database = astra_client.get_database(os.getenv("ASTRA_DB_API_ENDPOINT"))
    astradb_collection = database.get_collection(name=collection_name,
                                                 keyspace=keyspace)
    print(f"Collection: {astradb_collection.full_name}\n")
    return astradb_collection
def get_embedding(text: str):
    """
    Generate embedding for given text using OpenAI's embedding model.
    Args:
        text (str): Input text to embed
    Returns:
        Embedding vector for the input text
    """
    return OPENAI_CLIENT.embeddings.create(
        input=text, model=EMBEDDING_MODEL
    ).data[0].embedding
COLLECTION = get_collection(os.getenv("ASTRA_DB_COLLECTION_NAME"), os.getenv("ASTRA_DB_KEYSPACE"))

Lets go ahead with a simple retriever here, which takes the user query, embeds it and calculates cosine similarity to retrieve top 5 documents by default.

Our workflow essentially receives a user query, fetches similar documents from AstraDB, uses the documents as context to answer the user’s question.

def simple_retriever(query: str, n=5):
    """Retrieve top-N most relevant documents from your vector DB"""
    query_embedding = get_embedding(query)
    results = COLLECTION.find(sort={"$vector": query_embedding}, limit=n)
    docs = [doc["content"] for doc in results]
    return "\n".join(
        [f"\n\n===== Document {i+1} =====\n{doc}" for i, doc in enumerate(docs)]
    )
def vanilla_rag(question: str) -> str:
    """
    Generate a structured, grounded answer using retrieved documents and user query.
    """
    retrieved_docs = simple_retriever(question)
    # Build the grounded, strict reasoning/debugging prompt
    final_prompt = build_debugging_prompt(question, retrieved_docs)
    # Query the model
    response = OPENAI_CLIENT.chat.completions.create(
        model=GENERATION_MODEL,
        temperature=0,
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers software engineering questions."},
            {"role": "user", "content": final_prompt}
        ]
    )
    return response.choices[0].message.content

Now that we’ve set up a basic RAG, let’s start asking it questions. 

query = "Explain MCP to me"
explanation = vanilla_rag(query)
display(Markdown(explanation))

Notice how Unstructured data transformation workflow made sure that image content was also indexed by converting the graph into natural language that could be easily chunked and embedded. 

Conclusion

MCP is a new, simple, and useful protocol that streamlines data workflows without unnecessary complexity. 

If you followed this blog post end to end, congrats! You've learned the basics of how MCP works, used an MCP server and applied it for document ingestion following it up with RAG!

Ready to learn more? Check out the Unstructured MCP server and start building powerful document processing workflows today!

Keep Reading

Keep Reading

Recent Stories

Recent Stories

Mar 24, 2025

Building a No-Code Data Preprocessing Pipeline with Firecrawl and Unstructured MCP

Ajay Krishnan

RAG

Mar 24, 2025

Building a No-Code Data Preprocessing Pipeline with Firecrawl and Unstructured MCP

Ajay Krishnan

RAG

Mar 24, 2025

Building a No-Code Data Preprocessing Pipeline with Firecrawl and Unstructured MCP

Ajay Krishnan

RAG

Mar 24, 2025

Unstructured MCP Virtual Hackathon: Build, Share, and Win!

Unstructured

Unstructured

Mar 24, 2025

Unstructured MCP Virtual Hackathon: Build, Share, and Win!

Unstructured

Unstructured

Mar 24, 2025

Unstructured MCP Virtual Hackathon: Build, Share, and Win!

Unstructured

Unstructured

Mar 19, 2025

Unstructured's New MotherDuck Integration

Unstructured

Unstructured

Mar 19, 2025

Unstructured's New MotherDuck Integration

Unstructured

Unstructured

Mar 19, 2025

Unstructured's New MotherDuck Integration

Unstructured

Unstructured