Scarf analytics pixel

Apr 22, 2025

One Month into MCP: Building an Interface Between LLMs and Unstructured

Maria Khalusova

Unstructured

A little over a month ago, we kicked off development on our official Unstructured MCP Server. What started as a prototype has quickly evolved into a powerful new interface that bridges the Unstructured Platform with a variety of LLM-based tools and agents—using a common language: Model Context Protocol (MCP).

Why MCP?

For those unfamiliar, MCP (Model Context Protocol) is an open protocol designed to standardize how applications provide context to LLMs: instead of custom integrations for every new model or tool, MCP provides a unified approach. Once you implement your MCP server, tools like Claude Desktop, Windsurf, and any custom LLM agent can use it out of the box.

From Proof-of-Concept to Production-Ready

We started small. In the first iteration, the MCP server only supported a subset of the workflow functionality, and only a couple of connectors. Fast-forward to today, and the Unstructured MCP server now supports a rich suite of tools across the Unstructured API.

🛠️ Tools Available

Our MCP server now offers over 20 tools to manage sources, destinations, workflows, and jobs. Here are some highlights:

  • Create/update/delete source and destination connectors

  • Get detailed metadata and connection info

  • List, create, update, and execute workflows

  • Monitor and cancel jobs

Explore the full list of tools here in README.

🌐 Supported Connectors

We’ve expanded far beyond the initial S3 connectors. Currently supported connectors include:

Sources

Destinations

S3

S3

Azure Blob Storage

Weaviate

Google Drive

Pinecone

OneDrive

AstraDB

Salesforce

MongoDB

SharePoint

Neo4j


Databricks Volumes & Delta Tables

🔗 Each connector requires credentials via .env, with full setup instructions available here.

🔍 Firecrawl Integration: The Web as a Data Source

We also integrated Firecrawl, a powerful web crawler with LLM-optimized output. You can now:

  • Crawl entire websites

  • Generate clean, structured outputs with Unstructured’s own schema

  • Upload the results into any of the supported destinations

Firecrawl jobs are launched and monitored entirely through our MCP interface. Just don’t forget your FIRECRAWL_API_KEY!

🧠 Lessons Learned

Building the Unstructured MCP server has been an exciting journey and we have learned a thing or two. Here are some key lessons:

The Tool Count Matters: Less Can Be More

Unstructured API offers rich functionality with support for many different connectors and actions to manage them, as well as functionality to manage the workflows. We quickly realized that matching API functionality to MCP tools one to one would not be the best MCP design. Too many tools available can overwhelm and confuse LLMs, making it harder for the model to find precise tools for the tasks at hand. More importantly, having an excessive number of tools creates a documentation challenge, as it quickly consumes the available context space that the LLM can use. This is why it's critical to find a balance in the number of tools available. 

To ease the context window management challenge, we abstracted all of the connector management functionality. This resulted in a slashing of the context window usage by 5000 tokens! 

Input Schema is Expensive but Vital

A valuable lesson we’ve learned while developing the MCP server is how important a clear, well-defined input schema is. Building a good schema can be expensive in terms of token consumption, but good typing can provide all the necessary details to the LLM so that it can use the tools precisely as intended.

A Subset of Tools May Be What You Need

While the Unstructured MCP server supports many different tools, the reality is that not all tools are needed in every situation. Depending on your use case, you can restrict the MCP server to a smaller, more targeted set of tools. This will reduce the token usage for the documentation, may improve performance, and make it easier for LLMs to understand and execute their tasks.

Let’s take the Pydantic AI framework as an example (we’re using it for a secret project—shhh). By extending the base MCPServerHTTP class, we can pull in into the agent just the parts we actually need:

import os
from dataclasses import dataclass
from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerHTTP
from pydantic_ai.tools import ToolDefinition

  
@dataclass
class MCPServerHTTPWithSubsetTools(MCPServerHTTP):
    selected_tools: list[str]
    async def list_tools(self) -> list[ToolDefinition]:
        all_available_tools = await super().list_tools()
        filtered_tools = [
            tool for tool in all_available_tools if tool.name in self.selected_tools
        ]
        return filtered_tools

s3_tools = MCPServerHTTPWithSubsetTools(
    url=os.getenv("UNSTRUCTURED_MCP_SERVER_URL"),
    selected_tools=[
        "create_s3_source",
        "update_s3_source",
        "delete_s3_source",
    ],
)

source_configuration_agent = Agent(
    system_prompt="Your goals is to configure S3 source in Unstructured platform",
    mcp_servers=[s3_tools],
)

LLMs Can Be Unexpectedly Creative (And Autonomous)

This is where things get really interesting. While we were still in the process of refining the Unstructured MCP server, something unexpected happened — Claude Desktop figured out how to build workflows in the platform by itself! 🤯

Here’s what happened: it didn’t know the input parameters for a new workflow DAG, so instead, it listed existing workflows, picked one that was already configured, and used it as a template — inferring the input schema on its own, without explicit instructions.

This was a fun and impressive moment, but it also underscored something important: with MCP servers, you provide the tools, but you can’t always predict how LLMs will use them. Their creativity can be powerful — and unpredictable. It was a reminder that while LLMs can autonomously solve problems, they still need thoughtful guardrails to prevent unintended outcomes.

📈 What’s Next

This is just the beginning. We continuously add support for more source/destination connectors, release example notebooks with varying workflows, optimize tool loading and configuration, and polish the developer experience around MCP and different MCP clients. You can track progress in our CHANGELOG.md.

Curious to try it out Unstructured MCP Server for yourself? Check out the GitHub repo: github.com/Unstructured-IO/UNS-MCP

Let us know what you think, and stay tuned for more updates!

Keep Reading

Keep Reading

Recent Stories

Recent Stories

May 22, 2025

Level Up Your GenAI Apps: Data Processing Power-Ups

Maria Khalusova

RAG

May 22, 2025

Level Up Your GenAI Apps: Data Processing Power-Ups

Maria Khalusova

RAG

May 22, 2025

Level Up Your GenAI Apps: Data Processing Power-Ups

Maria Khalusova

RAG

May 21, 2025

Mastering PDF Transformation Strategies with Unstructured: Part 2

Tarun Narayanan

Unstructured

May 21, 2025

Mastering PDF Transformation Strategies with Unstructured: Part 2

Tarun Narayanan

Unstructured

May 21, 2025

Mastering PDF Transformation Strategies with Unstructured: Part 2

Tarun Narayanan

Unstructured

May 19, 2025

Accelerating On-Premises AI with Unstructured and NVIDIA Blackwell

Maria Khalusova

Unstructured

May 19, 2025

Accelerating On-Premises AI with Unstructured and NVIDIA Blackwell

Maria Khalusova

Unstructured

May 19, 2025

Accelerating On-Premises AI with Unstructured and NVIDIA Blackwell

Maria Khalusova

Unstructured