Scarf analytics pixel

Jan 16, 2025

Enterprise RAG: Why Connectors Matter in Production Systems

Unstructured

RAG

In the initial excitement about the possibilities of AI applications, teams spent months building Retrieval Augmented Generation prototypes over samples of data and experimenting with the latest models and techniques. However, current workflows center around building AI systems that deliver real business value - and the fundamental challenge lies in accessing and preparing the vast amounts of enterprise data needed to power AI in production. This is where Unstructured's extensive connector ecosystem comes into play, bridging the gap between raw enterprise data and AI-ready formats.

Universal Connectivity

Your organization's knowledge is scattered across SharePoint, Confluence, Salesforce, and various blob storages and databases. Each system has its own authentication methods, data formats, and access patterns. Furthermore, choosing and configuring appropriate destinations for the processed data, whether it's vector databases, search services, or cloud storage, introduces its own set of complexities related to schema, performance and integration. Traditionally, building secure, reliable connections to all of these systems would require months of development work followed by ongoing maintenance which is typically even more time consuming than the initial development.

Unstructured's connector ecosystem eliminates this complexity by offering (at the time of this writing) 71 pre-built connectors through our Serverless API and Ingest library, enabling 1,250+ unique pipelines between sources and destinations. Our Platform currently supports 30 enterprise-grade connectors (15 sources and 15 destinations), with rapid expansion ongoing – any connector available in the Ingest library will soon be available in the Platform.

Source Connectors

Unstructured source connectors span the entire enterprise data landscape:

  • Cloud Storage: Azure Blob Storage, Google Cloud Storage, Amazon S3

  • Collaboration Platforms: Confluence, SharePoint, Box, Dropbox

  • Business Applications: Salesforce, HubSpot, Jira

  • Databases: MongoDB, PostgreSQL, Snowflake, AstraDB, SingleStore DB

  • Communication Platforms: Slack, Discord, Outlook

  • Cloud-based Document Storage: Google Drive, OneDrive

  • Development Tools: GitHub, GitLab

  • Streaming Platforms: Kafka

  • And many more, from Airtable to Wikipedia

These are the places where you may want to pull the ground truth for your AI applications from. 

Destination Connectors

Once the data is ingested from one or more sources, and transformed into RAG-ready format, it should be made available for retrieval. You likely have strong opinions about databases, search engines and vector stores for this, but we’ve got you covered!

Unstructured supports all major vector databases and storage solutions:

  • Vector Stores: Pinecone, Weaviate, Milvus (including Zilliz Cloud and Milvus on IBM watsonx.data), Astra DB

  • Search Services: Azure AI Search, Elasticsearch, OpenSearch

  • Cloud Storage: Azure Blob Storage, Google Cloud Storage, S3

  • Traditional Databases: PostgreSQL, MongoDB, Snowflake

  • Specialized Vector DBs: Chroma DB, Qdrant, LanceDB, KDB.AI

  • Graph DBs: Neo4j

  • Data Storage and Management Platforms: Databricks Volumes

The Power of Platform

While the Unstructured Serverless API together with the Ingest library support single source to single destination pipelines, the Platform takes it further by enabling:

  • Multiple source connections in a single pipeline - bring all of your data from all of the sources into a single destination

  • Outputs delivered to multiple destinations - have backup destinations, or multiple destinations for experiments

  • Exponentially more pipeline possibilities through combinatorial configurations of multiple sources and multiple destinations

  • Production-grade workload scaling and scheduling

Beyond Simple File Transfer

What sets these Unstructured connectors apart isn't just their quantity – it's the intelligence they unlock when integrated into the Unstructured Platform. The platform doesn’t simply use the connectors to move files from point A to point B; instead, connectors play a role ensuring that:

1. Critical context is preserved: Unlike documents in academic datasets, enterprise documents don't exist in vacuum as standalone content—they carry history, origins, and context. Our source connectors capture vital metadata about update history, authorship, origins, and other organizational context, ensuring no valuable information is lost in transit. Learn more about metadata in our documentation: common metadata and connector-specific metadata.

2. Knowledge representation is standardized: Through our own carefully developed document ontology, we can transform content from disparate sources into a unified, standardized format. Whether your knowledge lives in Confluence pages, Slack messages, or SharePoint documents, it all gets converted into the same canonical JSON schema. This standardization is what makes it possible to seamlessly combine knowledge from different enterprise systems – a Salesforce record can be used in a RAG application alongside a technical document from SharePoint. This unified representation enables truly comprehensive knowledge retrieval and analysis across your entire enterprise data landscape.

3. Synchronization is smart: To optimize costs and processing time, connectors can intelligently identify and process only newly added content, making incremental updates efficient and economical. Consider a use case with 100,000 documents, where about 2,000 documents (2%) change daily. Your RAG system must have access to the up-to-date information at all times, otherwise it won’t be useful. Without smart incremental processing, you'd need to reprocess and re-embed the entire document base daily to maintain freshness, leading to significant processing and storage costs. With smart incremental processing, you only handle those 2,000 changed documents, reducing the monthly cost by over 90%. This difference becomes even more dramatic at enterprise scale: for a large organization with 5M documents requiring hourly updates, the cost savings can reach into the millions annually, not to mention the operational benefits of reduced system load and faster update cycles.

4. Structured and unstructured data come together: Unstructured connectors bridge the traditional gap between structured and unstructured data sources, enabling new hybrid use cases. The fusion of structured and unstructured data creates a richer knowledge foundation – imagine a support ticket RAG system that not only understands the content of support documentation but also knows about customer history, product configurations, and related bug reports. By breaking down these data silos, organizations can build more intelligent AI applications that leverage both the precision of structured data and the rich context of unstructured content, all processed and standardized by the same Platform.

Enterprise-Grade Security by Design

For teams working with enterprise data, security isn't just a checkbox – it's a fundamental requirement that can make or break an AI project’s implementation. Enterprise data often contains sensitive information, intellectual property, and regulated content that requires careful handling. 

For our enterprise customers with stringent security requirements, we offer customizable Virtual Private Cloud (VPC) deployment options across major cloud providers. This deployment model ensures complete isolation of your data processing infrastructure while maintaining seamless integration with your existing cloud environment.

At the same time, whether you’re using the Unstructured Platform hosted by us, or an in-VPC deployment, we made sure that the Unstructured's connector ecosystem is built with a security-first mindset. The following security features are consistent across all deployment types.

Comprehensive Authentication Options

Different enterprise systems require different security approaches. Where applicable, our connectors support multiple authentication methods to ensure secure access while maintaining flexibility:

  • OAuth 2.0 for cloud services and modern APIs

  • API key-based authentication

  • Basic authentication with encrypted credentials

  • Service account authentication for cloud platforms

  • Token-based authentication

Zero Data Persistence

One of the most critical security features is our zero-persistence architecture designed to ensure your data remains secure and under your control at all times. Here's how it works in Unstructured Platform:

  • No Persistent Storage of Data: Neither original nor processed data is stored persistently outside your environment. Temporary data created during processing is systematically cleaned and deleted by a dedicated process upon pipeline steps completion.

  • Minimal Database Footprint: Our database stores no actual user data. Instead, it relies on hashed or UUID pointers to maintain references, ensuring sensitive information is never retained.

  • No User Data Logging: We prioritize privacy by ensuring that no user data is logged during operations, reducing the risk of exposure.

  • Compliance with Data Residency Requirements: Our architecture is designed to align with data residency regulations, ensuring your data stays within the required geographic boundaries.

End-to-End Encryption

Our connectors ensure secure data transmission using TLS encryption for data in transit. While the specific encryption protocol depends on the service provider being connected to, TLS is the industry standard and is supported by nearly all modern services. For services that require alternative protocols, we strictly adhere to best practices to maintain secure connections, ensuring your data is always protected during transfer.

Secure Credential Handling

Protecting access credentials is as important as protecting the data itself, and the Unstructured Platform implements rigorous security measures to achieve this:

  • Unique RSA Keypair per Organization: Each organization is assigned a unique RSA keypair. The private key is securely stored in a vault, ensuring it remains protected.

  • Double Encryption of Credentials: Connector credentials are encrypted with the organization's public key before leaving the browser. The encrypted credentials are then stored in a separate vault, adding an extra layer of security.

  • Pointer-Based Access: When executing jobs, only a pointer to the encrypted credentials is transmitted. Credentials are accessible exclusively by a dedicated security component and only on a strictly need-to-know basis.

  • Isolated Credential Access: Access to credentials is restricted to the connector component alone. No other part of the pipeline can access them, ensuring a secure and isolated handling process.

These security measures provide robust protection for your access credentials, reducing the risk of unauthorized access while maintaining seamless and secure operational workflows.

Compliance and Certification

Our security measures aren't just claims – they're verified and certified. We hold a SOC2 Type 2 certification, reflecting our ongoing compliance with rigorous security controls, and we are HIPAA-compliant, ensuring the secure handling of protected health information.

For enterprises building AI systems, Unstructured Platform brings confidence in handling sensitive enterprise data, compliance with regulatory requirements, and protection of intellectual property while delivering top-notch data transformation. 

Streamlined Operations at Scale

In enterprise environments, network hiccups, API timeouts, and service disruptions are inevitable. Without proper error handling, these seemingly minor issues can cascade into major problems—stalling critical workflows, creating data gaps, and potentially impacting business operations. Our connectors are built with this reality in mind, providing robust infrastructure that doesn't just process data, but actively guards against common failure modes. The real power of our connectors becomes apparent when operating at enterprise scale:

  • Batch Processing and Scheduling: Configure automated ingestion schedules that align with your needs, and write the results into your destinations in batches. 

  • Error Handling and Reliability: Production systems need to be resilient. Our connectors include sophisticated retry mechanisms and graceful error handling, ensuring your data pipelines remain robust even when facing temporary network issues or service disruptions.

  • Flexible Configuration Options: Configure your connectors and workflows in the Platform’s intuitive UI, or programmatically via the headless Platform API.

Unstructured - The Foundation For Production AI

Regardless of the type of AI application you are building, success hinges on the quality and relevance of the data available for retrieval. If critical context is not stored and available to a RAG application, no retrieval strategy will compensate.

Good RAG starts with well-prepared data, and the Unstructured Platform simplifies this critical first step. Our connectors ensure that regardless of where your data originates or where it needs to go, it will arrive at the destination in a standardized format optimized for RAG use and true to original content.

Unstructured’s connector ecosystem continues to grow rapidly. With new connectors being added regularly to both our API and Platform, you and your team can spend less time wrestling with data preprocessing and more time building impactful AI solutions. After all, isn't that where your expertise should be focused?

Get Started with Unstructured Platform

We can’t wait for you to get going with the Unstructured Platform! Sign up today and try Platform for free, or book a session with one of our engineers to discuss how we can optimize Unstructured for your use case. 

Keep Reading

Keep Reading

Recent Stories

Recent Stories

Jan 16, 2025

Enterprise RAG: Why Connectors Matter in Production Systems

Unstructured

RAG

Jan 16, 2025

Enterprise RAG: Why Connectors Matter in Production Systems

Unstructured

RAG

Jan 16, 2025

Enterprise RAG: Why Connectors Matter in Production Systems

Unstructured

RAG

Dec 29, 2024

Transform files in S3 to Pinecone with Unstructured Platform with no code

Nina Lopatina

Unstructured

Dec 29, 2024

Transform files in S3 to Pinecone with Unstructured Platform with no code

Nina Lopatina

Unstructured

Dec 29, 2024

Transform files in S3 to Pinecone with Unstructured Platform with no code

Nina Lopatina

Unstructured

Dec 18, 2024

Introducing Unstructured Platform API for Programmatic Data Transformation

Unstructured

Unstructured

Dec 18, 2024

Introducing Unstructured Platform API for Programmatic Data Transformation

Unstructured

Unstructured

Dec 18, 2024

Introducing Unstructured Platform API for Programmatic Data Transformation

Unstructured

Unstructured