Mar 26, 2025
Building an End-to-End Data Pipeline with Custom NER on Unstructured using MCP
Tarun Narayanan
Unstructured
Unstructured offers an effective and user-friendly platform for handling unstructured data, which is essential for organizations working with GenAI applications. The Model Context Protocol (MCP) is pivotal to this, it connects Unstructured's advanced data processing capabilities with LLM interfaces like Claude Desktop, making interactions with the ETL+ platform simpler and more efficient. MCP provides a standardized method for applications to supply context to LLMs, effectively serving as a universal connector. This post will guide you through the construction of a bespoke end-to-end data pipeline incorporating custom Named Entity Recognition (NER) enrichment on the Unstructured platform, utilizing the power and simplicity of MCP. You can find a video demo of this here.
Demystifying MCP
The Model Context Protocol (MCP) is an open standard pioneered by Anthropic. It establishes a common language that enables LLMs to communicate effectively with external applications and data sources. Envision MCP as a "USB-C port" for AI applications, providing a consistent and standardized way for LLMs to connect with diverse data repositories and tools.
The architecture of MCP follows a client-server model. In this framework, LLM applications function as clients, initiating requests for information or actions. Conversely, services like the Unstructured API expose their capabilities through MCP servers, which handle these requests and provide responses.
MCP servers can offer three primary types of functionalities:
Resources: file-like data accessible to clients
Tools: functions that LLMs can invoke
Prompts: pre-written templates designed to assist users in accomplishing specific tasks
For the purpose of integrating with the Unstructured API, the focus lies primarily on "Tools" that facilitate the management of data processing workflows and connectors. This standardization that's built into MCP doesn't just make the integration process easier - it also helps create a more connected AI ecosystem.
The adoption of MCP within the Unstructured Platform offers several key advantages for developers and LLM integrations:
A standardized approach to AI integration
Flexibility in choice of LLM models and vendors
Enhanced security as data can remain within the user's infrastructure
Support for scalable and complex workflows
Reusability of connectors and tools
Unstructured API and MCP: A Powerful Combination
MCP acts as the key that unlocks the full potential of the Unstructured API for LLM-powered applications. The Unstructured MCP server serves as a powerful interface to the Unstructured API, granting access to a comprehensive suite of capabilities. These include the ability to manage document processing workflows, handle diverse source and destination connectors, and extract structured content from a wide array of document formats.
By constructing an MCP server that leverages the Unstructured API, you can interact with these functionalities using natural language commands issued through LLM clients. This integration streamlines the process of managing workflows and connectors:
Provisioning of Unstructured data for GenAI systems through simple natural language instructions
Automation of document processing pipelines through conversational commands
Streamlined management of various source and destination connectors
Hands-on Example: Building Your Custom Pipeline
Consider a practical scenario: the development of an end-to-end pipeline designed to process documents residing in an Amazon S3 bucket. This pipeline will incorporate a custom NER component to identify specific entities within the documents and subsequently store the processed results in another designated S3 bucket.
Step-by-Step Process
Step 1: Setting up Source and Destination Connectors
Within an MCP-enabled client, such as Claude Desktop, natural language prompts can be used to instruct the Unstructured MCP server to establish the necessary connectors. For authentication, you have two options:
1) Provide a .env file that the server automatically reads from
2) When using Claude Desktop, provide environment variables through the Claude Desktop config JSON
For instance, in our demo, we used natural language to create our connectors:
This natural language command triggers the create_s3_source
tool within the Unstructured MCP server. Similarly, we create a destination connector:
The MCP server functions as an intelligent intermediary, translating these natural language instructions into precise API calls that the Unstructured platform can understand and execute.
Step 2: Crafting a Custom Workflow
The definition of a custom workflow is achieved using the create_workflow
MCP tool and its workflow_config
parameter. In our demo, we provided a workflow configuration that includes a VLM partitioner and a custom NER component. Here's the workflow configuration we used:
This configuration specifies:
A VLM partitioner using Anthropic's Claude 3.5 Sonnet model
A custom NER component using OpenAI to extract named entities
The source and destination connectors created in Step 1
MCP enables the creation of highly tailored data processing pipelines by letting developers define specific workflow configurations, including integration of custom components such as NER plugins.
Step 3: Connecting the Dots
The link between the created source and destination connectors and the workflow is established within the workflow_config
. By referencing the unique IDs assigned to the source and destination connectors within the source_id
and destination_id
fields, the data flow for the pipeline is explicitly defined.
The workflow configuration serves as the central orchestrator, connecting the various components of the data pipeline, including the data source, the processing steps (partitioning and custom NER), and the final destination for the processed data.
Step 4: Executing the Workflow
To initiate the data processing pipeline, we use natural language to run the workflow:
This triggers the run_workflow
MCP tool, which starts processing documents from the source S3 bucket, applies the VLM partitioning and NER enrichment, and stores the results in the destination S3 bucket.
Step 5: Monitoring Workflow Status
To track the progress and status of the running workflow, we can ask:
This uses the get_workflow_info
MCP tool to retrieve the current status of the workflow. In our demo, we confirmed that the workflow was in the "active" state, indicating that it was currently processing our data.
Key Components Built
The end-to-end pipeline constructed in this example comprises the following key components:
A source S3 connector pointing to our input documents
A destination S3 connector for storing processed results
A custom workflow configured with:
A VLM partitioner using Claude 3.5 Sonnet
A custom NER component for entity extraction
Integrating with Claude Desktop
To enable interaction with the Unstructured MCP server from within Claude Desktop, a configuration file needs to be created. Navigate to the directory ~/Library/Application Support/Claude/
and create a file named claude_desktop_config.json
. The content of this file should define the Unstructured MCP server and its execution parameters. Below is an example of the configuration:
After creating this configuration file and restarting Claude Desktop, a small hammer icon will typically appear within the Claude interface, indicating the availability of MCP tools. You can then interact with these tools using natural language prompts as demonstrated in our example.
For debugging and testing the MCP server, the MCP Inspector tool can be accessed by running the command mcp dev server.py
in your terminal.
Conclusion
Combining MCP with the Unstructured API creates a strong and flexible framework for building custom data pipelines, especially when adding specialized parts like custom NER. This approach is beneficial because it's standardized, works well with LLM applications, and can manage complex data processing tasks using natural language.
We encourage you to further explore the Unstructured API and the provided MCP server codebase available on the Unstructured GitHub repository. Experimenting with different connectors and workflow configurations will unlock the full potential of this integration.