This project implements a machine learning system that allows users to retrieve relevant images based on natural language text queries. The system aligns text and image feature embeddings using a trained neural network and retrieves the closest-matching images based on cosine similarity.
Given a text query (e.g., "a cat sitting on a couch"), the system:
- Encodes the query using a pretrained text embedding model (MiniLM-L6-v2).
- Uses a trained feedforward neural network to map the text embedding into image embedding space.
- Compares the predicted image embedding against a database of image embeddings (extracted using Dinov2).
- Returns the most semantically relevant images using cosine similarity.
- Text Embedding: MiniLM-L6-v2
- Image Embedding: Dinov2
- Neural Network: 5-layer feedforward network
- Loss Function: Cosine Similarity Loss
- Optimization: Adam with ReduceLRonPlateau
- Training Dataset: TextCaps (28k images with 140k captions)
- MRR (Mean Reciprocal Rank)
- Precision / Recall / Jaccard Similarity (evaluated on both fine-grained and coarse-grained class sets)
📈 Achieved recall of 0.84 (coarse-grained), showing strong ability to retrieve semantically relevant images.
Run to find the best layer sizes and training config:
python hyperparameter_search.py
Train using the best configuration:
python train.py
Evaluate using cosine similarity & MRR:
python test.py
- 📝 Final Report
- 📄 Project Proposal
- 📚 TextCaps Dataset
- 🧠 Dinov2
- 🔤 MiniLM
- Lucas Butler
- Boxi Chen
- Anthony Pecoraro
- Hayat White