Smart Product Pricing Challenge - Amazon ML Challenge 2025

This repository contains our team's solution for the Amazon ML Challenge 2025. The goal is to predict product prices based on their catalog content and images. Our final approach is a robust, text-only model that prioritizes feature engineering and semantic understanding to achieve a strong result.

📂 Folder Structure

The repository is organized as follows:

├── dataset/         # Contains the raw and intermediate data files provided for the challenge.
├── output/          # Where the final submission file is saved.
├── src/             # All Python source code for preprocessing and model training.
├── README.md        # You are here!
└── requirements.txt # A list of all necessary Python libraries.

💡 Our Approach

Our methodology is a sequential pipeline that focuses on extracting the maximum value from the text data provided.

Preprocessing & Manual Feature Engineering: We first run a script (src/preprocess.py) to clean the raw text data. This script also uses regular expressions to engineer two critical features:
- ipq (Item Pack Quantity): Extracts the number of items in a pack (e.g., "pack of 12" becomes 12).
- normalized_size_oz: Extracts product weight/volume (e.g., 16 oz, 2 lbs, 750ml) and converts them all to a standard unit (ounces) for fair comparison.
Text Embedding: To capture the rich semantic meaning of the product descriptions (brand, quality, materials), we use the state-of-the-art thenlper/gte-large model from the sentence-transformers library. This converts each product's catalog_content into a high-dimensional feature vector.
Modeling: The final feature set (manual features + text embeddings) is used to train a LightGBM Regressor. To optimize for the SMAPE metric, the model is trained to predict the logarithm of the price (np.log1p(price)), and the final predictions are converted back.

1. Run the Pipeline

The process is a simple two-step execution.

Step 1: Preprocess the Data This script will create the train_clean.csv and test_clean.csv files in the main directory.

python src/preprocess.py

Step 2: Train the Model and Generate Submission This script will generate text embeddings, train the LightGBM model, run cross-validation, and create the final submission.csv file in the main directory.

python src/train_model.py

📊 Results

Our model achieves the following performance based on a 5-fold cross-validation on the training data:

Average Cross-Validation SMAPE: **~ 52.4**

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
output		output
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
Documentation.md		Documentation.md
README.md		README.md
README.txt		README.txt
submission.csv		submission.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart Product Pricing Challenge - Amazon ML Challenge 2025

📂 Folder Structure

💡 Our Approach

1. Run the Pipeline

📊 Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Smart Product Pricing Challenge - Amazon ML Challenge 2025

📂 Folder Structure

💡 Our Approach

1. Run the Pipeline

📊 Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages