Docs

Schedule a demo

Mar 26, 2025

Building an End-to-End Data Pipeline with Custom NER on Unstructured using MCP

Tarun Narayanan

Unstructured

Unstructured offers an effective and user-friendly platform for handling unstructured data, which is essential for organizations working with GenAI applications. The Model Context Protocol (MCP) is pivotal to this, it connects Unstructured's advanced data processing capabilities with LLM interfaces like Claude Desktop, making interactions with the ETL+ platform simpler and more efficient. MCP provides a standardized method for applications to supply context to LLMs, effectively serving as a universal connector. This post will guide you through the construction of a bespoke end-to-end data pipeline incorporating custom Named Entity Recognition (NER) enrichment on the Unstructured platform, utilizing the power and simplicity of MCP. You can find a video demo of this here.

Demystifying MCP

The Model Context Protocol (MCP) is an open standard pioneered by Anthropic. It establishes a common language that enables LLMs to communicate effectively with external applications and data sources. Envision MCP as a "USB-C port" for AI applications, providing a consistent and standardized way for LLMs to connect with diverse data repositories and tools.

The architecture of MCP follows a client-server model. In this framework, LLM applications function as clients, initiating requests for information or actions. Conversely, services like the Unstructured API expose their capabilities through MCP servers, which handle these requests and provide responses.

MCP servers can offer three primary types of functionalities:

Resources: file-like data accessible to clients
Tools: functions that LLMs can invoke
Prompts: pre-written templates designed to assist users in accomplishing specific tasks

For the purpose of integrating with the Unstructured API, the focus lies primarily on "Tools" that facilitate the management of data processing workflows and connectors. This standardization that's built into MCP doesn't just make the integration process easier - it also helps create a more connected AI ecosystem.

The adoption of MCP within the Unstructured Platform offers several key advantages for developers and LLM integrations:

A standardized approach to AI integration
Flexibility in choice of LLM models and vendors
Enhanced security as data can remain within the user's infrastructure
Support for scalable and complex workflows
Reusability of connectors and tools

Unstructured API and MCP: A Powerful Combination

MCP acts as the key that unlocks the full potential of the Unstructured API for LLM-powered applications. The Unstructured MCP server serves as a powerful interface to the Unstructured API, granting access to a comprehensive suite of capabilities. These include the ability to manage document processing workflows, handle diverse source and destination connectors, and extract structured content from a wide array of document formats.

By constructing an MCP server that leverages the Unstructured API, you can interact with these functionalities using natural language commands issued through LLM clients. This integration streamlines the process of managing workflows and connectors:

Provisioning of Unstructured data for GenAI systems through simple natural language instructions
Automation of document processing pipelines through conversational commands
Streamlined management of various source and destination connectors

Hands-on Example: Building Your Custom Pipeline

Consider a practical scenario: the development of an end-to-end pipeline designed to process documents residing in an Amazon S3 bucket. This pipeline will incorporate a custom NER component to identify specific entities within the documents and subsequently store the processed results in another designated S3 bucket.

Step-by-Step Process

Step 1: Setting up Source and Destination Connectors

Within an MCP-enabled client, such as Claude Desktop, natural language prompts can be used to instruct the Unstructured MCP server to establish the necessary connectors. For authentication, you have two options:

1) Provide a .env file that the server automatically reads from

2) When using Claude Desktop, provide environment variables through the Claude Desktop config JSON

For instance, in our demo, we used natural language to create our connectors:

"Create an S3 source connector for my folder at s3://my-sample-documents"

This natural language command triggers the create_s3_source tool within the Unstructured MCP server. Similarly, we create a destination connector:

"Create an S3 destination connector pointing to s3://my-output-folder"

The MCP server functions as an intelligent intermediary, translating these natural language instructions into precise API calls that the Unstructured platform can understand and execute.

Step 2: Crafting a Custom Workflow

The definition of a custom workflow is achieved using the create_workflow MCP tool and its workflow_config parameter. In our demo, we provided a workflow configuration that includes a VLM partitioner and a custom NER component. Here's the workflow configuration we used:

{
  "name": "S3-VLM-NER-Workflow",
  "source_id": "<source-connector-id>",
  "destination_id": "<destination-connector-id>",
  "workflow_type": "custom",
  "workflow_nodes": [
    {
      "name": "Partitioner",
      "type": "partition",
      "subtype": "vlm",
      "settings": {
        "provider": "anthropic",
        "provider_api_key": null,
        "model": "claude-3-5-sonnet-20241022",
        "output_format": "text/html",
        "user_prompt": null,
        "format_html": true,
        "unique_element_ids": true
      }
    },
    {
      "name": "Enrichment",
      "type": "prompter",
      "subtype": "openai_ner",
      "settings": {
        "prompt_interface_overrides": {
          "prompt": {
            "user": "Extract all named entities from the following text. Include people, organizations, locations, dates, and other relevant entities."
          }
        }
      }
    }
  ],
  "schedule": null
}

This configuration specifies:

A VLM partitioner using Anthropic's Claude 3.5 Sonnet model
A custom NER component using OpenAI to extract named entities
The source and destination connectors created in Step 1

MCP enables the creation of highly tailored data processing pipelines by letting developers define specific workflow configurations, including integration of custom components such as NER plugins.

Step 3: Connecting the Dots

The link between the created source and destination connectors and the workflow is established within the workflow_config. By referencing the unique IDs assigned to the source and destination connectors within the source_id and destination_id fields, the data flow for the pipeline is explicitly defined.

The workflow configuration serves as the central orchestrator, connecting the various components of the data pipeline, including the data source, the processing steps (partitioning and custom NER), and the final destination for the processed data.

Step 4: Executing the Workflow

To initiate the data processing pipeline, we use natural language to run the workflow:

"Run the S3-VLM-NER-Workflow"

This triggers the run_workflow MCP tool, which starts processing documents from the source S3 bucket, applies the VLM partitioning and NER enrichment, and stores the results in the destination S3 bucket.

Step 5: Monitoring Workflow Status

To track the progress and status of the running workflow, we can ask:

"What is the status of my workflow?"

This uses the get_workflow_info MCP tool to retrieve the current status of the workflow. In our demo, we confirmed that the workflow was in the "active" state, indicating that it was currently processing our data.

Key Components Built

The end-to-end pipeline constructed in this example comprises the following key components:

A source S3 connector pointing to our input documents
A destination S3 connector for storing processed results
A custom workflow configured with:
- A VLM partitioner using Claude 3.5 Sonnet
- A custom NER component for entity extraction

Integrating with Claude Desktop

To enable interaction with the Unstructured MCP server from within Claude Desktop, a configuration file needs to be created. Navigate to the directory ~/Library/Application Support/Claude/ and create a file named claude_desktop_config.json. The content of this file should define the Unstructured MCP server and its execution parameters. Below is an example of the configuration:

{
    "mcpServers": {
        "UNS_MCP": {
            "command": "ABSOLUTE/PATH/TO/.local/bin/uv",
            "args": ["--directory", "ABSOLUTE/PATH/TO/uns-mcp", "run", "server.py"],
            "env": {
                "UNSTRUCTURED_API_KEY": "<your key>"
            },
            "disabled": false
        }
    }
}

After creating this configuration file and restarting Claude Desktop, a small hammer icon will typically appear within the Claude interface, indicating the availability of MCP tools. You can then interact with these tools using natural language prompts as demonstrated in our example.

For debugging and testing the MCP server, the MCP Inspector tool can be accessed by running the command mcp dev server.py in your terminal.

Conclusion

Combining MCP with the Unstructured API creates a strong and flexible framework for building custom data pipelines, especially when adding specialized parts like custom NER. This approach is beneficial because it's standardized, works well with LLM applications, and can manage complex data processing tasks using natural language.

We encourage you to further explore the Unstructured API and the provided MCP server codebase available on the Unstructured GitHub repository. Experimenting with different connectors and workflow configurations will unlock the full potential of this integration.