This repository contains our team's solution for the Amazon ML Challenge 2025. The goal is to predict product prices based on their catalog content and images. Our final approach is a robust, text-only model that prioritizes feature engineering and semantic understanding to achieve a strong result.
The repository is organized as follows:
├── dataset/ # Contains the raw and intermediate data files provided for the challenge.
├── output/ # Where the final submission file is saved.
├── src/ # All Python source code for preprocessing and model training.
├── README.md # You are here!
└── requirements.txt # A list of all necessary Python libraries.
Our methodology is a sequential pipeline that focuses on extracting the maximum value from the text data provided.
-
Preprocessing & Manual Feature Engineering: We first run a script (
src/preprocess.py) to clean the raw text data. This script also uses regular expressions to engineer two critical features:ipq(Item Pack Quantity): Extracts the number of items in a pack (e.g., "pack of 12" becomes 12).normalized_size_oz: Extracts product weight/volume (e.g., 16 oz, 2 lbs, 750ml) and converts them all to a standard unit (ounces) for fair comparison.
-
Text Embedding: To capture the rich semantic meaning of the product descriptions (brand, quality, materials), we use the state-of-the-art
thenlper/gte-largemodel from thesentence-transformerslibrary. This converts each product'scatalog_contentinto a high-dimensional feature vector. -
Modeling: The final feature set (manual features + text embeddings) is used to train a LightGBM Regressor. To optimize for the SMAPE metric, the model is trained to predict the logarithm of the price (
np.log1p(price)), and the final predictions are converted back.
The process is a simple two-step execution.
Step 1: Preprocess the Data
This script will create the train_clean.csv and test_clean.csv files in the main directory.
python src/preprocess.pyStep 2: Train the Model and Generate Submission
This script will generate text embeddings, train the LightGBM model, run cross-validation, and create the final submission.csv file in the main directory.
python src/train_model.pyOur model achieves the following performance based on a 5-fold cross-validation on the training data:
- Average Cross-Validation SMAPE: **~ 52.4**