Skip to content

Repository for Parma Calcio Data Scientist assignment. Includes two tasks: building an xG model using StatsBomb open data (event data / freeze-frame) and predicting the 2015/16 Ballon d’Or winner from Big-5 leagues data. Implemented in Python with notebooks and reusable modules.

Notifications You must be signed in to change notification settings

Manuele23/Parma-assignment

Repository files navigation

Parma Calcio 1913 Data Scientist Assignment

Python Jupyter StatsBombPy pandas scikit-learn matplotlib seaborn

Parma Calcio Logo

This repository contains my solution to the Parma Calcio 1913 Data Scientist technical assignment.
The project is divided into two main tasks, each organized in its own folder:

  • Task 1 – xG Model (task1_xg/): building and evaluating multiple expected goals (xG) models
  • Task 2 – Ballon d’Or 2015/16 (task2_ballon_dor/): ranking players from the Big 5 leagues season 2015/16 to determine the best player according to data

All data come from the StatsBomb Open Data repository and are accessed programmatically using statsbombpy, so no manual downloads are required.

Setup & Installation

Clone the repository locally:

git clone <repo_url>    # <repo_url> = "https://github.com/Manuele23/Parma-assignment.git"
cd <repo_name>          # <repo_name> = "Parma-assignment" 

Option 1 – Using setup.ipynb (recommended)

Open and run setup.ipynb.
It will install all required dependencies automatically in your environment (it considers requirements.txt).

Option 2 – Manual installation with virtual environment

If you prefer an isolated environment:

# create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate   # on Linux/Mac
venv\Scripts\activate      # on Windows

# install dependencies
pip install -r requirements.txt

Reproducibility

Once the environment is ready:

  1. Clone the repository
  2. Set up the environment (via setup.ipynb or virtual env)
  3. Run the notebooks in the provided order

[!] Git LFS (for large files)

One large file (shots_df.csv) is tracked with Git LFS.
You may need to install Git LFS if you want to download it directly from the repository,
but this is not strictly required since the file can also be regenerated by running the notebooks.

git lfs install
git lfs pull

Project Structure

├── task1_xg/                      
│   ├── data/                        # created Datasets for the task
│   ├── models/                      # trained models (excluded if large)
│   ├── outputs/                     # generated evaluation metrics for each model
│   ├── 01_data_exploration.ipynb  
│   ├── 02_shot_analysis.ipynb     
│   ├── 03_dataset_building.ipynb  
│   ├── 04_linear_regression.ipynb 
│   ├── 05_random_forest.ipynb     
│   ├── 06_xgboost.ipynb           
│   ├── 07_neural_network.ipynb    
│   ├── 08_model_comparison.ipynb  
│   ├── 09_ds_final.ipynb          
│   └── xg_demo.ipynb              
│
├── task2_ballon_dor/              
│   ├── data/                        # created Datasets for the task            
│   ├── 01_data_preparation.ipynb  
│   ├── 02_data_preprocessing.ipynb
│   └── 03_final_ranking.ipynb     
│
├── Assignment.pdf                 
├── requirements.txt               
├── setup.ipynb                    
├── README.md                      
└── .gitignore / .gitattributes

How to Run

Task 1 – Expected Goals (xG) Model

  • Navigate to task1_xg/
  • For a quick demo, run xg_demo.ipynb and go to the last cell to try the demo (note: this step may take a while, as demo widgets construction is long)
  • Otherwise, run the notebooks in order from 01 to 09 (note: this step may take a while, as data retrieval is long)
  • Note: the Random Forest model (model_rf.pkl) is not included in the repo because of its size:
    • To regenerate it, run 05_random_forest.ipynb.
  • Final evaluation is in 08_model_comparison.ipynb and 09_ds_final.ipynb

Task 2 – Ballon d’Or 2015/16

  • Navigate to task2_ballon_dor/
  • If you just want to see the final ranking, run only:
    • 03_final_ranking.ipynb
  • If you prefer to re-download the datasets and reprocess everything from scratch, run in order:
    1. 01_data_preparation.ipynb (note: this step may take a while, as data retrieval is long)
    2. 02_data_preprocessing.ipynb
    3. 03_final_ranking.ipynb

Notes

  • Only StatsBomb open data was used, as required by the assignment
  • Each notebook includes detailed Markdown cells that explain the rationale behind the methodology, the assumptions made, and the simplifications adopted step by step
  • Models and outputs too large for GitHub are excluded, but can be easily regenerated locally by running the corresponding notebooks

About

Repository for Parma Calcio Data Scientist assignment. Includes two tasks: building an xG model using StatsBomb open data (event data / freeze-frame) and predicting the 2015/16 Ballon d’Or winner from Big-5 leagues data. Implemented in Python with notebooks and reusable modules.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published