Skip to content

Vasist10/amazon-ml-challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Smart Product Pricing Challenge - Amazon ML Challenge 2025

This repository contains our team's solution for the Amazon ML Challenge 2025. The goal is to predict product prices based on their catalog content and images. Our final approach is a robust, text-only model that prioritizes feature engineering and semantic understanding to achieve a strong result.


📂 Folder Structure

The repository is organized as follows:

├── dataset/         # Contains the raw and intermediate data files provided for the challenge.
├── output/          # Where the final submission file is saved.
├── src/             # All Python source code for preprocessing and model training.
├── README.md        # You are here!
└── requirements.txt # A list of all necessary Python libraries.

💡 Our Approach

Our methodology is a sequential pipeline that focuses on extracting the maximum value from the text data provided.

  1. Preprocessing & Manual Feature Engineering: We first run a script (src/preprocess.py) to clean the raw text data. This script also uses regular expressions to engineer two critical features:

    • ipq (Item Pack Quantity): Extracts the number of items in a pack (e.g., "pack of 12" becomes 12).
    • normalized_size_oz: Extracts product weight/volume (e.g., 16 oz, 2 lbs, 750ml) and converts them all to a standard unit (ounces) for fair comparison.
  2. Text Embedding: To capture the rich semantic meaning of the product descriptions (brand, quality, materials), we use the state-of-the-art thenlper/gte-large model from the sentence-transformers library. This converts each product's catalog_content into a high-dimensional feature vector.

  3. Modeling: The final feature set (manual features + text embeddings) is used to train a LightGBM Regressor. To optimize for the SMAPE metric, the model is trained to predict the logarithm of the price (np.log1p(price)), and the final predictions are converted back.


1. Run the Pipeline

The process is a simple two-step execution.

Step 1: Preprocess the Data This script will create the train_clean.csv and test_clean.csv files in the main directory.

python src/preprocess.py

Step 2: Train the Model and Generate Submission This script will generate text embeddings, train the LightGBM model, run cross-validation, and create the final submission.csv file in the main directory.

python src/train_model.py

📊 Results

Our model achieves the following performance based on a 5-fold cross-validation on the training data:

  • Average Cross-Validation SMAPE: **~ 52.4**

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages