Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.
β Drop a star to help us grow!
Deutsch | English | EspaΓ±ol | franΓ§ais | ζ₯ζ¬θͺ | νκ΅μ΄ | PortuguΓͺs | Π ΡΡΡΠΊΠΈΠΉ | δΈζ
CocoIndex makes it effortless to transform data with AI, and keep source data and target in sync. Whether youβre building a vector index for RAG, creating knowledge graphs, or performing any custom data transformations β goes beyond SQL.
Just declare transformation in dataflow with ~100 lines of python
# import
data['content'] = flow_builder.add_source(...)
# transform
data['out'] = data['content']
.transform(...)
.transform(...)
# collect data
collector.collect(...)
# export to db, vector db, graph db ...
collector.export(...)
CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.
Particularly, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.
Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components.
CocoIndex keep source data and target in sync effortlessly.
It has out-of-box support for incremental indexing:
- minimal recomputation on source or logic change.
- (re-)processing necessary portions; reuse cache when possible
If you're new to CocoIndex, we recommend checking out
- π Documentation
- β‘ Quick Start Guide
- π¬ Quick Start Video Tutorial
- Install CocoIndex Python library
pip install -U cocoindex
- Install Postgres if you don't have one. CocoIndex uses it for incremental processing.
Follow Quick Start Guide to define your first indexing flow. An example flow looks like:
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
# Add a data source to read files from a directory
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
# Add a collector for data to be exported to the vector index
doc_embeddings = data_scope.add_collector()
# Transform data of each document
with data_scope["documents"].row() as doc:
# Split the document into chunks, put into `chunks` field
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500)
# Transform data of each chunk
with doc["chunks"].row() as chunk:
# Embed the chunk, put into `embedding` field
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))
# Collect the chunk into the collector.
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
# Export collected data to a vector index.
doc_embeddings.export(
"doc_embeddings",
cocoindex.targets.Postgres(),
primary_key_fields=["filename", "location"],
vector_indexes=[
cocoindex.VectorIndexDef(
field_name="embedding",
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
It defines an index flow like this:
Example | Description |
---|---|
Text Embedding | Index text documents with embeddings for semantic search |
Code Embedding | Index code embeddings for semantic search |
PDF Embedding | Parse PDF and index text embeddings for semantic search |
Manuals LLM Extraction | Extract structured information from a manual using LLM |
Amazon S3 Embedding | Index text documents from Amazon S3 |
Azure Blob Storage Embedding | Index text documents from Azure Blob Storage |
Google Drive Text Embedding | Index text documents from Google Drive |
Docs to Knowledge Graph | Extract relationships from Markdown documents and build a knowledge graph |
Embeddings to Qdrant | Index documents in a Qdrant collection for semantic search |
FastAPI Server with Docker | Run the semantic search server in a Dockerized FastAPI setup |
Product Recommendation | Build real-time product recommendations with LLM and graph database |
Image Search with Vision API | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend |
Face Recognition | Recognize faces in images and build embedding index |
Paper Metadata | Index papers in PDF files, and build metadata tables for each paper |
Multi Format Indexing | Build visual document index from PDFs and images with ColPali for semantic search |
Custom Output Files | Convert markdown files to HTML files and save them to a local directory, using CocoIndex Custom Targets |
Patient intake form extraction | Use LLM to extract structured data from patient intake forms with different formats |
More coming and stay tuned π!
For detailed documentation, visit CocoIndex Documentation, including a Quickstart guide.
We love contributions from our community β€οΈ. For details on contributing or running the project for development, check out our contributing guide.
Welcome with a huge coconut hug π₯₯βqΛπ€. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.
Join our community here:
- π Star us on GitHub
- π Join our Discord community
βΆοΈ Subscribe to our YouTube channel- π Read our blog posts
We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star β at GitHub repo to stay tuned and help us grow.
CocoIndex is Apache 2.0 licensed.