Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R

This is the repository for the paper "Assessment of projection pursuit index for classifying high dimension low sample size data in R" by Zhaoxing Wu and Chunming Zhang.

./data contains all data used in the paper.
./output contains all plots used in the paper.
./script includes all the code.
- ./script/fun.R: useful functions, including S(), plot_test_train(), acc(), cross_validation(). Section 2 of the paper also explains these functions in details.

Keywords

Large p Small n; Linear Discriminant Analysis; Penalized Discriminant Analysis; Supervised Classification; SVM

Citation

@article{wu_zhang_2023,
    author = {Zhaoxing Wu and Chunming Zhang},
    title = {Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R},
    journal = {Journal of Data Science},
    volume = {21},
    number = {2},
    year = {2023},
    pages = {310--332},
    doi = {10.6339/23-JDS1096},
    issn = {1680-743X},
    publisher = {School of Statistics, Renmin University of China}
}

To reproduce the experiments in the paper

Simulation study (section 3: simulation evaluation): Run ./script/simulated_example.Rmd to simulate datasets under 4 different conditions, including ./data/1_perc_imp_var.csv, ./data/2_ratio_dim_obs.csv, ./data/3_num_classes.csv, ./data/4_outliers_imp.csv, ./data/4_outliers.csv. Code plotting the above datasets is also contained in ./script/simulated_example_plot.Rmd

Microarray data analysis (section 4.1: microarray data): Run ./script/datamicroarray.Rmd to analyze different microarray datasets. Please note that different microarray datasets are analyzed in this paper and all contained in the R package datamicroarray, so one need to manually change the code to use different datasets to reproduce the results (instructions can be found in the comments of the code).

Music data analysis (section 4.2: music data): Run ./script/extract_music_features.py to extract features from music clips and generate ./data/music.csv. The music clips can be found in ./data/processed_music/*. Run ./script/music.Rmd to analyze the music dataset.

Non-Gaussian distributed data analysis (section 5: discussion and conclusion): Run the last chunck of code in ./script/simulated_example.Rmd. The mean values of different model performance are printed to the screen.

Examples

The following code loads leukemia dataset and splits it into training and test sets.

library(classPP)
source("fun.R")
library(cancerclass)
data("GOLUB1") #leukemia data
df = as.data.frame(t(scale(GOLUB1@assayData[["exprs"]])))
cls = GOLUB1@phenoData@data[["class"]] 
ALL = GOLUB1@phenoData@data[["type"]] 
class = c()
for (i in 1:length(cls))
    class = c(class, trimws(paste(cls[i], ALL[i])))
n = nrow(df)

ind = sample(1:n, n/4, replace=FALSE)
test = df[ind,]
train = df[-ind,]
cls_test = class[ind]
cls_train = class[-ind]

Project leukemia data into 2 dimensions using the PDA index and plot the projected training and test sets for comparison.

PP.opt = PP.optimize.anneal("PDA", 2, train, cls_train, lambda = 0.6)
proj.data.test = as.matrix(test)%*%PP.opt$proj.best
proj.data.train = as.matrix(train)%*%PP.opt$proj.best
plot_test_train(proj.data.test, proj.data.train, cls_test, cls_train,
               levels(as.factor(class)))

Get the accuracy score for the PDA index on the leukemia data.

acc(PP.opt$proj.best, t(train), t(test), cls_train, cls_test)

Hyperparameter selection for PDA index:

S(as.matrix(df), class, "PDA", 2, seq(0, 0.99, 0.01))

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
output		output
script		script
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R

Keywords

Citation

To reproduce the experiments in the paper

Examples

About

Uh oh!

Languages

Zhaoxing-Wu/projection-pursuit-index

Folders and files

Latest commit

History

Repository files navigation

Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R

Keywords

Citation

To reproduce the experiments in the paper

Examples

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages