Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task
Code for source code embedding algorithms described in the paper Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task. This repository also includes code implementing control flow graph-based source code embeddings for reproducing the experiments described in our paper Source Code Embeddings Based on Control Flow Graphs and Markov Chains for Program Classification.
- Install Docker CE and GNU make.
- Clone the repository, then clone the submodules using
git submodule update --init --recursive
- Download the dataset [2] from Zenodo and extract the
task-*.csv
files intosrc/data
. - Classification targets can contain digits, so navigate to
external/code2vec/common.py
and apply the patch:
@staticmethod
def legal_method_names_checker(special_words, name):
- return name != special_words.OOV and re.match(r'^[a-zA-Z|]+$', name)
+ return name != special_words.OOV
- Run
make notebook
from repository root, run the notebooks.
- Gorchakov, A.V.; Demidova, L.A.; Sovietov, P.N. Analysis of Program Representations Based on Abstract Syntax Trees and Higher-Order Markov Chains for Source Code Classification Task. Future Internet 2023, 15, 314.
- Demidova, L.A.; Andrianova, E.G.; Sovietov, P.N.; Gorchakov, A.V. Dataset of Program Source Codes Solving Unique Programming Exercises Generated by Digital Teaching Assistant. Data 2023, 8 (6), p. 109.
- Gorchakov, A.V.; Demidova, L.A.; Maslennikov, V.V. Source Code Embeddings Based on Control Flow Graphs and Markov Chains for Program Classification. Proceedings of the 2024 6th International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency (SUMMA). IEEE, 2024, pp 328-333.
If you use the code from this repository in your research work, please consider citing 1 or 3.