-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DocArray wrap ANN libraries #17
Comments
Project idea 2: DocArray wrap ANN libraries
Project Description
Expected outcomes
Jina's DocArray is a data structure that represents a list of documents with additional metadata. DocArray is designed to be compatible with popular ANN libraries like FAISS, Annoy, and Hnswlib. To wrap an ANN library around Jina's DocArray, you can follow these general steps: Convert DocArray to a compatible format: Most ANN libraries require a specific format for the data, like a numpy array or a list of lists. You can use Jina's get_all_sparse_vectors method to convert the DocArray to a compatible format. For example: import numpy as np doc_array = DocArray([{'text': 'hello world'}, {'text': 'foo bar'}]) Convert to numpy arraydata = np.stack(doc_array.get_all_sparse_vectors()) Create an index: Next, you need to create an index using the ANN library. For example, you can use FAISS to create an index: Create an indexindex = faiss.IndexFlatL2(data.shape[1]) Query the index: Finally, you can use the ANN library to query the index with a new document. For example, you can use FAISS to find the nearest neighbors of a new document: Query the indexquery_vec = np.random.rand(1, data.shape[1]).astype('float32') Get the DocArray for the nearest neighborsnearest_neighbors = doc_array[indices[0]] By following these steps, you should be able to wrap an ANN library around Jina's DocArray and use it to perform nearest neighbor search or other ANN tasks. Of course, you will need to add more code to handle things like data preprocessing, index optimization, and query filtering, but this should give you a good starting point. |
Hey everyone! DocArray v2 will have a concept called As such, there can be multiple Document Indexes backed by different backends: Elastic, Qdrant, Weaviat, ...., but all following the same basic API. The idea behind this project is to take an ANN library and use it to implement a Document Index. But there is space to create similar backends using other libraries: Annoy, Faiss, ... The goal is to provide user choice. If there is interest, someone could also implement a backend using a vector database. We already have Qdrant, Weaviate and Elastic covered, but Milvus, Redis and some otthers could also be interesting. You can find a design doc for Document Index here: https://lightning-scent-57a.notion.site/Document-Stores-v2-design-doc-f11d6fe6ecee43f49ef88e0f1bf80b7f If you have any questions please reach out! |
Please mind that "Mentors | @johannes" does not refer to the GitHub user @johannes, but probably @JohannesMessner. I can encourage you to do great things, but not help with the project. Have fun! |
Ah, so we meet again @johannes! Snatching a common user name is a blessing and a curse I see. Thanks for the encouragement, and who knows, maybe if we keep randomly tagging you here and there one day you will be compelled to contribute as well ;) |
Hello! Thank you for sharing the details of your project and providing a clear explanation of what you are trying to achieve. It sounds like you are working on creating a flexible and scalable document indexing solution using different ANN libraries and vector databases. Providing user choice and flexibility is always a great approach when it comes to open-source projects. I appreciate that you have shared the design doc as well, which will help potential contributors understand the project's scope and requirements. If I have any questions or would like to contribute to the project, I'll make sure to reach out. Thanks again for sharing this project with us. Best regards, |
Hello @JohannesMessner , I have been interested in GSOC contribution for 2023 and prior experience with Machine Learning algorithms and ANN search using the Python framework attracted my interest in this project. I would love to work on this under your valuable mentorship. I am providing my idea and implementation according to my experience of working with various ANN libraries and Jina architecture. PROJECT IDEA : DocArray wrap ANN librariesProject Description Importance of this project Expected Results Project breakdown
Required Technicalities Additional Area of development In the project idea you have hinted about implemening a backend framework in this project for vector database. Jina has already achieved it with the Qdrant but the Milvus framework can be a step ahead because of its scalability and efficiency.I propose to integrate the pymilvus library along with the ANN searching to provide a visual representation of the idea and create a better impact of overall project. We can carry forward the discussion after your feedback. |
Hi @JohannesMessner @philipvollet I am a Masters student studying AI in University of Hamburg, Germany. I have knowledge in topics like statistical ML, NLP, computer vision. I have worked in multiple projects in Python, Pytorch, Keras. Apart from datascience stack I also have experience working in Java, Php, Swift. I came across this topic and got interested on work on it. I am a bit late to apply but I am interested to contribute and gain experience from this project. Could you please help me getting started with the project and if any call can be setup for a discussion |
Hi @Anirbanbhk88 @ranjan2829 @arijitghosal03 @Anirbanbhk88 Thanks for your interest in contributing to the project. The application is just started, to ensure fairness, we do not open 1:1 calls during the application season from March 20 to April 4. 📅 But we have the webinar, Mark your calendars for the GSoC x Jina AI webinar on March 23rd at 2 pm (CET). This is an excellent opportunity to learn more about the projects and ask any questions you have about the requirements and expectations. Our mentors will provide an in-depth overview of the projects and answer any questions you may have. So please don't hesitate to ask any questions or seek clarification on any aspect of the project. Is there anything specific you would like to learn from the webinar? Do you have any questions about the DocArray wrap ANN libraries project that you would like to see clarified during the Q&A session? Let me know, and I'll be happy to help! Looking forward to seeing you at the webinar, and thank you for your interest in the Jina AI community! 😊 |
Project idea 2: DocArray wrap ANN libraries
Project Description
In DocArray, we have been concentrating on developing production-ready Vector DBs for large-scale searches. However, there are many ANN libraries without scalability layers that can be integrated into DocArray, making it accessible to academia and production teams with small-to-medium amounts of data, without the need for external services.
DocArray v2 will have a concept called Document Index. This is an abstraction that lets a user store their Documents (on disk or in a database), and retrieve them using ANN search. As such, there can be multiple Document Indexes backed by different backends: Elastic, Qdrant, Weaviat, ...., but all following the same basic API.
The idea behind this project is to take an ANN library and use it to implement a Document Index. There is already an implementation using HNSWLib that you can find here: feat: hnswlib document index docarray/docarray#1124, But there is space to create similar backends using other libraries: Annoy, Faiss, ... The goal is to provide user choice.
If there is interest, someone could also implement a backend using a vector database. We already have Qdrant, Weaviate, and Elastic covered, but Milvus, Redis, and some others could also be interesting. You can find a design doc for Document Index here.
Expected outcomes
The text was updated successfully, but these errors were encountered: