Skip to content

jq-11/PCA_nais

Repository files navigation

PCA (Unsupervised ML) Project

This is a repo of the Principal Component Analysis (PCA) project I did under the National Agroclimate Information Service (NAIS), Agriculture and Agri-Food Canada (AAFC/AAC).

Explanation

The goal of this project was to investigate and showcase the efficacy of using an unsupervised machine learning technique (such as PCA) to help predict drought categories in Canada.

Note that these files were made to help inform future applications of PCA for the CDS (Canadian Drought Signal). Hence, comments and explanation may be specific to CDS and those currently working on said project.

  • The PowerPoint presentation, "Principal Component Analysis (PCA).pptx", explains PCA, the underlying math, different supervised learning ML techniques that we can use to apply the outputs of PCA, and some statistics to evaluate the efficacy of PCA under said models. Please download to view. Note that the image and source citations are purely hyperlinks to the original source, proper citation is currently not included.

  • dev_scripts/... contains the starter code files, where I was testing out tutorials and code leading up to demonstration.ipynb and official.py.

  • demonstration.ipynb is a Jupyter Notebook file that clearly explains the Python code and process of implementing PCA and testing its result as inputs into a logistic regression and random forest model. The output is visible below each code snippet.

The data is currently linked to the Canadian Drought Signal (CDS) training data, which is not public. If you wish to run this code, load any data source with at least 3 variables. (You may want to redefine the workplace variable and related data frame calls so that your data is loaded properly according to the type of file it is in. Addtionally, comment out import cds_utilities as cds.) There were roughly 2.8 million data points, each with 24 variables, in the training data.

  • official.py is the script equivalent of demonstration.ipynb. It is written for those more experienced with Python/PCA, so there may be less explanation and more technical wording. official.py does not include the random forest model (~25 minutes) to reduce runtime.

About

The official demo, script and starter code for implementing Principal Component Analysis (PCA) on Canadian Drought Signal (CDS) training data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors