This repository contains the source code for a code comprehension predictor service for computational notesbooks.
To run the code, install the requirements by executing the following command:
pip install -r requirements.txt
After installing the requirements, you can run the CLI or API and start using the service.
To use the functionalities provided in this repository, you will need certain CSV files containing notebook code and markdown cell data. These files can be found here (DistilKaggle: a distilled dataset of Kaggle Jupyter notebooks) and here (A Predictive Model to Identify Effective Metrics for the Comprehension of Computational Notebooks).
Use below download links to get started
- notebook_metrics.csv: notebooks features file, mainly used to train models.
- code.csv: mainly used for metrics extraction.
- augmented_kernel_quality.csv: notebook scores file, mainly used to train models.
- sample1050_labeled_by_experts.csv: used to evaluate the models.
- src: Contains the main code that provides code comprehension prediction and metrics evaluation.
- src/core: includes the main python files of the project. These classes and functions do the actual work behind the interfaces.
- src/utils: helper files used to manage the project like config.py where we manage all the configurations.
- src/notebooks: base notebook files that support the paper's results.
- dataframes: Contains basic data of selected jupyter notebooks for training models. For example, code.csv that contains the source codes used in each notebook, and markdown.csv that has the markdown cells data.
- metrics: Contains CSV files with metrics of selected Jupyter notebooks for training the models. For instance, code_cell_metrics.csv contains metrics of each code cell in the notebook, markdown_cell_metrics.csv contains markdown cell metrics of each notebook, and notebook_metrics.csv holds the aggregated metrics of all cells in the notebook.
- notebooks: Stores the notebooks provided to be predicted by the code.
- models: Stores the trained models.
- logs: Keeps the log files.
- cache: Is used for cached data.
First, cd to the src directory and then execute cli.py
file and start your journey.
cd src
export PYTHONPATH="$(pwd)"
python cli.py --help
Use --help for each command to get further instructions. Some use cases are provided below.
python cli.py
python cli.py extract-dataframe-metrics --help
python cli.py extract-dataframe-metrics --chunk-size 100 --limit-chunk-count 5
python cli.py extract-dataframe-metrics ../dataframes/markdown.csv ../metrics/markdown_cell_metrics.csv --chunk-size 100 --limit-chunk-count 5 --file-type markdown
python cli.py aggregate-metrics --help
python cli.py aggregate-metrics ../metrics/code_cell_metrics.csv ../metrics/markdown_cell_metrics.csv ../metrics/notebook_metrics_lite.csv
python cli.py extract-notebook-metrics --help
python cli.py extract-notebook-metrics ../notebooks/file.ipynb ../notebooks/results.json
python cli.py extract-notebook-metrics ../notebooks/file.ipynb ../notebooks/results.csv
python cli.py predict ../notebooks/file.ipynb cat_boost ../models/catBoostClassifier.withOutPT.sf50.sr20.combined_score.v2.model
python cli.py predict ../notebooks/file.ipynb cat_boost ../models/catBoostClassifier.withPT.sf50.sr20.combined_score.v2.model --pt-score 10
First, cd to the src directory and then execute main.py
file and start your journey.
cd src
export PYTHONPATH="$(pwd)"
python main.py
after this you can see the documentation of the apis at http://localhost:8000/docs.
Use below command to build and run the image using docker compose
docker compose up --build