This repository contains the source code and data used in article Pattern-Based Graph Classification: Comparison of Quality Measures and Importance of Preprocessing.
Content
This repository is composed of the following elements:
requirements.txt
: List of required Python packages.src
: folder containing the source codeClusteringComparison.py
: script that reproduces the experiments of Section 5.2.1. and Section 5.2.3.KendallTauHistogram.py
: script that reproduces the experiments of Section 5.2.2.PairwiseComparisons.py
: script that reproduces the experiments of Section 5.3.GoldStandardComparison.py
: script that reproduces the experiments of Section 5.4.
data
: folder containing the input data. Each subfolder corresponds to a distinct dataset, cf. Section Datasets.results
: files produced by the processing.
First, you need to install the Python
language and the required packages:
- Install the
Python
language - Download this project from GitHub and unzip.
- Execute
pip install -r requirements.txt
to install the required packages (see also Section Dependencies).
Second, one of the dependencies, SPMF, is not a Python package, but rather a Java program, and therefore requires a specific installation process:
- Download its source code on Philippe Fournier-Viger's website.
- Follow the installation instructions provided on the same website.
Note that we use the JAR implementation of SPMF.
We retrieved the datasets from the SPMF website; they include:
MUTAG
: MUTAG dataset, representing chemical compounds and their mutagenic properties [D'91]NCI1
: NCI1 dataset, representing molecules and classified according to carcinogenicity [W'06]PTC
: PTC dataset, representing molecules and classified according to carcinogenicity [T'03]DD
: DD dataset, representing amino acids and their interactions [D'03]IMDB-Binary
: IMDB-Binary dataset, representing movie collaboration graphs [Y'15]
We retrieve two dataset from the TU Dataset website:
AIDS
dataset, representing chemical compounds tested for AIDS inhibition [R'08]FRANKENSTEIN
dataset, representing chemical compounds tested and their mutagenic properties [O'15]
The public procurement dataset contains graphs extracted from the FOPPA database, available on Zenodo:
FOPPA
: dataset extracted from FOPPA, a database of French public procurement notices [P'23b]
We provide two scripts to reproduces the expriments:
General.sh
: reproduces all experiments described in our paper.OneDataset.sh
(dataset): reproduces the experiments concerning the specific dataset.
Each script extracts the data and then performs the associated experiments.
Tested with python
version 3.12.2 and the following packages:
pandas
: version 2.2.1numpy
: version 1.26.4networkx
: version 3.2.1sklearn
: version 1.2.2matplotlib
: version 3.8.0tqdm
: version 4.66.4rbo
: version 0.1.3shap
: version 0.45.0xgboost
: version 2.1.0scipy
: version 1.11.4
Tested with SPMF
version 2.62, which implements gSpan [Y'02] (to mine frequent patterns)
- [D'91] A. S. Debnath, R. L. Lopez, G. Debnath, A. Shusterman, C. Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity, Journal of Medicinal Chemistry 34(2):786–797, 1991. DOI: 10.1021/jm00106a046
- [D'03] P. D. Dobson, A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments, Journal of Molecular Biology 330(4):771–783, 2003. DOI: 10.1016/S0022-2836(03)00628-4
- [H'14'] M. Houbraken, S. Demeyer, T. Michoel, P. Audenaert, D. Colle, M. Pickavet. The Index-Based Subgraph Matching Algorithm with General Symmetries (ISMAGS): Exploiting Symmetry for Faster Subgraph Enumeration, PLoS ONE 9(5):e97896, 2014. DOI: 10.1371/journal.pone.0097896.
- [O'15] F. Orsini, P. Frasconi, L. De Raedt. Graph invariant kernels, 24th International Conference on Artificial Intelligence, pp. 3756–3762, 2015. DOI: 10.5555/2832747.2832773
- [P'23b] L. Potin, V. Labatut, P. H. Morand & C. Largeron. FOPPA: An Open Database of French Public Procurement Award Notices From 2010–2020, Scientific Data, 2023, 10:303. DOI: 10.1038/s41597-023-02213-z
- [T'03] H. Toivonen, A. Srinivasan, R. D. King, S. Kramer, C. Helma. Statistical evaluation of the predictive toxicology challenge 2000-2001, Bioinformatics 19(10):1183–1193, 2003. DOI: 10.1093/bioinformatics/btg130
- [W'06] N. Wale, G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification, 6th International Conference on Data Mining, pp. 678–689, 2006. DOI: 10.1007/s10115-007-0103-5
- [Y'02] X. Yan, J. Han. gSpan: Graph-based substructure pattern mining, IEEE International Conference on Data Mining, pp.721-724, 2002. DOI: 10.1109/ICDM.2002.1184038
- [Y'15] P. Yanardag, S.V.N. Vishwanathan. Deep Graph Kernels, 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374, 2015. DOI: 10.1145/2783258.2783417