How to Process S3 Data to Milvus Using the Unstructured Platform

Unstructured

Integrations

This article explores three key components for building Retrieval Augmented Generation (RAG) systems: Amazon S3 for scalable object storage, Milvus for efficient vector similarity search, and the Unstructured Platform for preprocessing unstructured data into a structured format optimized for RAG. By combining these technologies, developers can create powerful applications that extract insights from large volumes of unstructured data, enabling businesses to drive innovation and gain a competitive edge.

With the Unstructured Platform, you can effortlessly transform your data from Amazon S3 to Milvus. Designed as an enterprise-grade ETL solution, the platform ingests raw, unstructured data from sources like S3, cleans and structures it into AI-ready JSON formats, and seamlessly loads it into databases such as Milvus. For a step-by-step guide, check out our S3 Integration Documentation and our Milvus Setup Guide. Keep reading for more details about S3, Milvus, and Unstructured Platform.

What is Amazon S3? What is it used for?

Amazon S3 (Simple Storage Service) is an object storage service from Amazon Web Services (AWS). It stores and retrieves data from anywhere on the web, supporting use cases like data lakes, cloud-native applications, and big data analytics.

S3 stores objects (data and metadata) in buckets. Objects can be up to 5 terabytes, with no limit on total data volume or object count. This makes S3 suitable for storing large volumes of unstructured data for AI applications.

Data Storage and Access Management

S3 offers flexible data storage with various storage classes, each with different durability, availability, and performance characteristics. This allows cost optimization based on specific needs.

Access management features include bucket policies, IAM policies, and Access Control Lists (ACLs). These control data access and actions, ensuring security and compliance with organizational policies.

Data Ingestion and Processing

Data ingestion into S3 uses methods like AWS CLI, APIs, or SDKs. Tools such as the Unstructured Ingest CLI or Python library process and convert unstructured data into structured formats for analysis or ingestion into systems like Milvus.

S3 integrates with other AWS services and third-party tools for data processing pipelines. AWS Lambda can trigger processing tasks when new objects are added to S3 buckets, while AWS Glue performs ETL operations on data.

Security and Compliance

S3 supports secure transfer protocols like HTTPS and SSL/TLS for data encryption in transit. It integrates with AWS Identity and Access Management (IAM) for user authentication and AWS Security Token Service (STS) for temporary access tokens.

For data at rest, S3 offers server-side encryption options: Amazon S3-managed keys (SSE-S3), AWS Key Management Service keys (SSE-KMS), or customer-provided keys (SSE-C). This meets various security and compliance requirements.

Amazon S3's scalability, durability, and security features provide a foundation for data processing workflows. When used with the Unstructured Platform, which processes and converts unstructured data into structured formats for Milvus ingestion, it enables efficient data utilization and organizational innovation.

What is Milvus? What is it used for?

Milvus is an open-source vector database designed for managing and searching large-scale vector data. It processes embedding vectors, which are high-dimensional numerical representations of unstructured data like text, images, audio, and video. Milvus handles billions of vectors efficiently, making it suitable for AI applications that require fast similarity search and feature extraction.

The distributed architecture of Milvus separates storage and compute, enabling horizontal scaling and handling of numerous search queries on billions of vectors with minimal performance impact. Its microservice design allows independent scaling of specific functions. Milvus offers multiple ANN (Approximate Nearest Neighbor) index types, including HNSW, IVF_FLAT, and IVF_SQ8, along with GPU acceleration for fast vector retrieval across various use cases.

Key Features and Benefits

Scalability and Performance: Milvus scales to billions of vectors while maintaining sub-second query latency through its cloud-native, distributed architecture.
Flexible Deployment: Milvus supports Standalone, Cluster, and Embedded deployment modes with a unified API.
Hybrid Search: Combines vector similarity search with structured filtering. Milvus allows filtering on scalar data types like integers and floats.
Integration-friendly: Milvus integrates with machine learning frameworks, analytics tools, and custom applications via SDKs, RESTful APIs, and plugins.
Cloud and Platform Agnostic: Deployable on public or private cloud platforms and on-premises using Docker or Kubernetes.

Real-world Applications

Milvus enables businesses to extract insights from unstructured data:

Recommendation Systems: Creates personalized product or content recommendations based on user preferences and behavior similarities.
Image and Video Retrieval: Enables similarity-based image and video search for visual product search and content discovery.
Natural Language Processing: Improves text similarity, document clustering, and semantic search applications.
Biometric Identification: Accelerates facial recognition, voiceprint authentication, and fingerprint matching for access control.
Anomaly Detection: Identifies rare events or outliers in large datasets for fraud detection and predictive maintenance.

As organizations accumulate unstructured data, Milvus plays a crucial role in utilizing its potential. When combined with data processing solutions that preprocess unstructured data into vector embeddings—such as Unstructured.io, which streamlines the flow from data source to Milvus—businesses can access previously unavailable insights to drive innovation and gain competitive advantages.

Unstructured Platform

The Unstructured Platform streamlines the preparation of unstructured data for AI applications. It provides a no-code, pay-as-you-go interface for transforming unstructured data into a format optimized for Retrieval Augmented Generation (RAG). The platform is useful for businesses preprocessing large volumes of unstructured data from various sources and loading results into vector databases.

The platform's workflow includes:

Connect: Unstructured offers source connectors to ingest data from existing locations like AWS S3, Google Drive, or Azure.
Route: The platform selects appropriate processing strategies for transforming documents into Unstructured's canonical JSON schema. Options include 'Fast' for extractable text, 'Hi Res' for PDFs and tables, and 'Auto' for automatic decision-making based on document type.
Transform: This step involves processing and partitioning documents into standardized elements. Unstructured converts documents into a JSON schema with over 20 elements, including Header, Footer, Title, NarrativeText, Table, and Image, along with extensive metadata.
Chunk: Various chunking strategies optimize data for specific use cases, such as Basic, By Title, By Page, and By Similarity.
Enrich: Unstructured can extract text from images using OCR and generate summaries for textual content, enhancing data value for downstream applications.
Embed: The platform integrates with embedding providers like OpenAI and Cohere via API, enabling vector representations using transformer-based models for sentence embeddings.
Persist: Unstructured offers destination connectors to store transformed data in vector databases such as Milvus, Pinecone, Weaviate, and Elasticsearch.

Key Concepts

Source Connector: Ingests data from various sources and handles data extraction, facilitating connection to existing data locations.
Destination Connector: Specifies where transformed data should be written, supporting vector databases and data storage solutions for RAG systems.
Workflow: Connects sources to destinations with options for chunking, embedding, and scheduling.
Jobs: Allows users to monitor data transformation task progress.

The Unstructured Platform is SOC 2 type 2 compliant and supports over 50 languages. It processes over 25 different file types, ensuring compatibility with a wide range of unstructured data sources.

By using the Unstructured Platform, organizations can process unstructured data from various sources, transform it into a structured format, and load it into vector databases for advanced AI applications. This workflow helps businesses extract insights from their data in an increasingly data-driven environment.

At Unstructured, we're committed to helping you streamline your data preprocessing workflows and maximize the value of your unstructured data. Our platform is designed to make the process of preparing data for AI applications as seamless and efficient as possible. If you're ready to take the next step in your data journey, get started with Unstructured today and experience the difference for yourself.

Keep Reading

Recent Insights

Integrations

How to Process Google Drive Data to Kafka Using the Unstructured Platform

Integrations

How to Process Google Drive Data to Kafka Using the Unstructured Platform

Integrations

How to Process Google Drive Data to Google Cloud Storage Using the Unstructured Platform

Integrations

How to Process Google Drive Data to Google Cloud Storage Using the Unstructured Platform

Integrations

How to Process Google Drive Data to Elasticsearch Efficiently

Integrations