Complete implementation of Partial Least Squares Discriminant Analysis (PLS-DA) for tandem mass spectrometry data analysis.
This repository provides a comprehensive, production-ready implementation of PLS-DA specifically designed for MS/MS (Mass Spectrometry) IDA (Information Dependent Acquisition) data analysis. Perfect for metabolomics, proteomics, and biomarker discovery workflows.
- Complete PLS-DA implementation using NIPALS algorithm
- Zero dependencies - uses only base R
- Cross-validation for optimal model selection
- Variable Importance (VIP) scores for biomarker discovery
- 7 publication-ready visualizations
- 4 detailed CSV exports for further analysis
- Comprehensive documentation with examples
- Example dataset included for immediate testing
# Clone the repository
git clone https://github.com/[your-username]/PLSDA-MSMS-Analysis.git
cd PLSDA-MSMS-Analysis# In R console
source("plsda_analysis.R")
# Or from command line
Rscript plsda_analysis.RThat's it! The script will:
- Generate example MS/MS data (150 samples, 50 features, 3 classes)
- Perform PLS-DA analysis with cross-validation
- Create 7 plots and 4 data files
- Display comprehensive results summary
Runtime: ~5-10 seconds for example data
The analysis automatically generates:
- Scores Plot - Sample clustering and group separation
- Test Predictions - Model validation on held-out data
- VIP Scores - Top 20 most important features
- Loadings Plot - Feature contribution to separation
- CV Results - Model optimization curve
- Confusion Matrix - Classification performance heatmap
- Variance Explained - Component importance
- Predictions - Sample classifications with scores
- VIP Scores - All features ranked by importance
- Model Summary - Performance metrics (accuracy, R²X, R²Y)
- Loadings - Feature contributions to each component
- QUICK_START.md - Get running in 5 minutes
- Sections 1-5 of README_FULL.md - Basic concepts and usage
- README_FULL.md - Complete 2000+ line documentation including:
- Statistical theory and mathematical background
- Using your own data (step-by-step guide)
- Parameter tuning and optimization
- Troubleshooting common issues
- Best practices for publication
- Advanced topics (permutation tests, bootstrap, etc.)
Perfect for:
- Metabolomics - Identify metabolite biomarkers between groups
- Proteomics - Discover differentially expressed proteins
- Lipidomics - Classify samples based on lipid profiles
- Biomarker Discovery - Rank features by discriminative power
- Quality Control - Validate analytical methods
- Method Development - Optimize sample preparation protocols
| Feature | Benefit |
|---|---|
| Handles high dimensionality | Works with 1000s of m/z features |
| Manages multicollinearity | Correlated features (common in MS data) |
| Supervised classification | Uses group labels for maximum separation |
| Interpretable results | VIP scores, loadings, and scores plots |
| Robust to noise | Dimensionality reduction filters noise |
| No external dependencies | Pure R implementation |
- PCA is unsupervised (ignores group labels)
- PLS-DA maximizes separation between known groups
- PLS-DA provides biomarker rankings (VIP scores)
- PLS-DA is designed for classification tasks
- R version ≥ 4.0.0
- No additional packages required!
- Memory: Minimum 4GB RAM (8GB recommended)
- Storage: ~10MB for outputs
This implementation uses:
- NIPALS algorithm for PLS component extraction
- Stratified train-test split (70-30 default)
- K-fold cross-validation for component optimization
- VIP scores for feature importance ranking
- Dummy matrix encoding for multi-class problems
See README_FULL.md Section 13 for complete statistical background.
Your CSV should have this structure:
SampleID,Class,Batch,mz_200.00,mz_250.00,mz_300.00,...
S001,Control,1,1234.5,2345.6,3456.7,...
S002,Treatment,1,5678.9,6789.0,7890.1,...
See README_FULL.md Section 9 for detailed integration guide.
Quick integration:
# Load your data
my_data <- read.csv("your_msms_data.csv")
# Extract features and labels
feature_cols <- grep("^mz_", colnames(my_data), value = TRUE)
features <- as.matrix(my_data[, feature_cols])
labels <- factor(my_data$Class)
# Continue with script from Section 2 (Preprocessing)Contributions are welcome! Areas for improvement:
- Additional preprocessing methods
- Support for more data formats
- Integration with pathway analysis tools
- Additional visualization options
- Performance optimizations
Please open an issue or submit a pull request.
If you use this code in your research, please cite:
This repository:
AyebBlk. (2025). PLS-DA Analysis for MS/MS IDA Data.
GitHub: https://github.com/AyehBlk/PLSDA-MSMS-Analysis
Original PLS-DA method:
Barker, M., & Rayens, W. (2003). Partial least squares for discrimination.
Journal of Chemometrics, 17(3), 166-173.
NIPALS algorithm:
Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-regression:
a basic tool of chemometrics. Chemometrics and Intelligent Laboratory
Systems, 58(2), 109-130.
This project is licensed under the MIT License - see the LICENSE file for details.
You are free to:
- Use for academic research
- Use for commercial projects
- Modify and distribute
- Include in your own projects
Just include the license and attribution!
Problem: Low accuracy (<70%)
- Check if classes are actually separable (biological question)
- Ensure sufficient samples (minimum 20 per class recommended)
- Try different preprocessing methods
Problem: Error "Singular matrix"
- Features are too highly correlated
- Remove features with correlation >0.95
Problem: All VIP scores <1
- May need fewer components
- Check if preprocessing is appropriate
- Verify classes are actually different
See README_FULL.md Section 12 for complete troubleshooting guide.
- Check the comprehensive documentation
- Read the FAQ
- Open an issue for bugs
- Start a discussion for questions
If you find this useful, please consider giving it a star! ⭐
- MetaboAnalyst - Web-based metabolomics analysis
- KEGG - Pathway database
- mixOmics R package - Extended multivariate methods
- ropls Bioconductor - Alternative PLS-DA implementation
Ayeh Bolouki
- GitHub: @AyehBlk
- Role: Computational Biologist / Bioinformatician
Active Development - Maintained and open to contributions
Current Version: 1.0 Last Updated: October 2025
Made with ❤️ - Let's make free science for everybody around the world.
If this helped your research, consider citing it in your publications!