This repository contains my solution to the Parma Calcio 1913 Data Scientist technical assignment.
The project is divided into two main tasks, each organized in its own folder:
- Task 1 – xG Model (
task1_xg/): building and evaluating multiple expected goals (xG) models - Task 2 – Ballon d’Or 2015/16 (
task2_ballon_dor/): ranking players from the Big 5 leagues season 2015/16 to determine the best player according to data
All data come from the StatsBomb Open Data repository and are accessed programmatically using statsbombpy, so no manual downloads are required.
Clone the repository locally:
git clone <repo_url> # <repo_url> = "https://github.com/Manuele23/Parma-assignment.git"
cd <repo_name> # <repo_name> = "Parma-assignment" Open and run setup.ipynb.
It will install all required dependencies automatically in your environment (it considers requirements.txt).
If you prefer an isolated environment:
# create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate # on Linux/Mac
venv\Scripts\activate # on Windows
# install dependencies
pip install -r requirements.txtOnce the environment is ready:
- Clone the repository
- Set up the environment (via
setup.ipynbor virtual env) - Run the notebooks in the provided order
One large file (shots_df.csv) is tracked with Git LFS.
You may need to install Git LFS if you want to download it directly from the repository,
but this is not strictly required since the file can also be regenerated by running the notebooks.
git lfs install
git lfs pull├── task1_xg/
│ ├── data/ # created Datasets for the task
│ ├── models/ # trained models (excluded if large)
│ ├── outputs/ # generated evaluation metrics for each model
│ ├── 01_data_exploration.ipynb
│ ├── 02_shot_analysis.ipynb
│ ├── 03_dataset_building.ipynb
│ ├── 04_linear_regression.ipynb
│ ├── 05_random_forest.ipynb
│ ├── 06_xgboost.ipynb
│ ├── 07_neural_network.ipynb
│ ├── 08_model_comparison.ipynb
│ ├── 09_ds_final.ipynb
│ └── xg_demo.ipynb
│
├── task2_ballon_dor/
│ ├── data/ # created Datasets for the task
│ ├── 01_data_preparation.ipynb
│ ├── 02_data_preprocessing.ipynb
│ └── 03_final_ranking.ipynb
│
├── Assignment.pdf
├── requirements.txt
├── setup.ipynb
├── README.md
└── .gitignore / .gitattributes
- Navigate to
task1_xg/ - For a quick demo, run
xg_demo.ipynband go to the last cell to try the demo (note: this step may take a while, as demo widgets construction is long) - Otherwise, run the notebooks in order from 01 to 09 (note: this step may take a while, as data retrieval is long)
- Note: the Random Forest model (
model_rf.pkl) is not included in the repo because of its size:- To regenerate it, run
05_random_forest.ipynb.
- To regenerate it, run
- Final evaluation is in
08_model_comparison.ipynband09_ds_final.ipynb
- Navigate to
task2_ballon_dor/ - If you just want to see the final ranking, run only:
03_final_ranking.ipynb
- If you prefer to re-download the datasets and reprocess everything from scratch, run in order:
01_data_preparation.ipynb(note: this step may take a while, as data retrieval is long)02_data_preprocessing.ipynb03_final_ranking.ipynb
- Only StatsBomb open data was used, as required by the assignment
- Each notebook includes detailed Markdown cells that explain the rationale behind the methodology, the assumptions made, and the simplifications adopted step by step
- Models and outputs too large for GitHub are excluded, but can be easily regenerated locally by running the corresponding notebooks