Skip to content

IBMPredictiveAnalytics/STATS_BORUTAFEATURES

Repository files navigation

STATS BORUTAFEATURES Extension Command

This procedure finds important predictive features (predictive variables) in a dataset for a scale or categorical dependent variable. using the Boruta algorithm. It iteratively compares importances of attributes with importances of shadow attributes created by shuffling case values of the original ones. Attributes that have significantly worst importance than shadow ones are consecutively dropped. Attributes that are significantly better than shadows are admitted to be Confirmed.

Although this procedure does not provide coefficient estimates or significance levels, like any empirical variable selection algorithm, the results are subject to the problems of overfitting, so using a holdout sample or other external validation would be wise.

STATS BORUTAFEATURES
DEPVAR = dependent variable*
INDVARS = candidate independent variables*

/OPTIONS
BONFERRONI= YES** or NO
PVALUE = confidence level
MAXRUNS = maximum number of importance runs
PLOT = YES** or NO

/HELP
STATS BORUTAFEATURES /HELP displays this information and does nothing else.

* Required
** Default


STATS BORUTAFEATURES DEPVAR=salary INDVARS=AGE EDUC JOBTIME
/OPTIONS BONFERRONI=NO.

Details

Case weights and SPLIT FILES are not supported by this procedure.

DEPVAR specifies a scale or categorical dependent variable.

INDVARS specifies one or more scale or categorical independent variables.

Options

BONFERRONI Specifies whether or not to apply a Bonferroni multiple testing correction.

PVALUE specifies the confidence level to be used for categorizing the variables as Confirmed, Rejected, or Tentative. The default value is .01.

MAXRUNS specifies the maximum number of importance runs. Increasing it may resolve the variables classified as tentative. The default value is 100.

PLOT specifies whether or not to display an importance boxplot over the runs for the independent variables

The Algorithm

Adapted from Miron B. Kursa reference

In each iteration, shadows are generated by randomly shuffling the cases for each variable, and the extended dataset is fed to an importance provider - a random forest algorithm. Original features’ importance based on Z scores is then compared with the highest importance of a shadow; and those which score higher are given a hit. Accumulated hit counts are finally assessed; features which significantly outperform best shadow are claimed confirmed, while those which significantly under-perform best shadow are claimed rejected and removed from the set for all subsequent iterations.

Iterating this process, it counts the number of times a feature does better than the random ones and ranks their importance based on this count.

A variable may be classified as important(confirmed), unimportant (rejected) or tentative. A table of the decisions is displayed.

The algorithm stops when only Confirmed attributes are left or when it reaches maxRuns importance source runs. If the second scenario occurs, some attributes may be left without a decision. They are classified as Tentative. You can increase maxRuns or lower pValue to resolve them

Warning: This algorithm can require considerable time to execute when the number of cases is large - greater than around 2000 or the number of variables is large.

Note that Boruta does a sharp classification of features rather than ordering, which is in contrast to many other feature selection methods. The other substantial difference is that Boruta is an all relevant method, and, hence, aims to find all features connected with the decision

References

This extension command is based on the R Boruta module by Miron Bartosz Kursa and Witold Remigiusz Rudnicki

Miron B. Kursa, Witold R. Rudnicki (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36(11), p. 1-13. URL: doi:10.18637/jss.v036.i11

Feature Selection in R with the Roruta R Package, https://www.datacamp.com/tutorial/feature-selection-R-boruta

In a Hurry Vignette

© Copyright Jon K Peck 2025, 2025

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published