Skip to content

Latest commit

 

History

History
173 lines (121 loc) · 9.62 KB

File metadata and controls

173 lines (121 loc) · 9.62 KB

EPFL_AMR

Research code for machine learning on photonic crystal optical transmission signals for antimicrobial sensing and bacterial characterization.

Project overview

This project explores a simple but important scientific question: when a bacterium is trapped on a photonic chip, the device records how light transmission sent to the bacterium changes over time. The core hypothesis behind this work is that this temporal optical signal contains a measurable signature of the bacterium’s biological state.

More importantly, if successful the project would prove to be a new way of fighting one of the 10 most alarming crisis for the WHO, Antimicrobial Resistance : https://www.who.int/news-room/fact-sheets/detail/antimicrobial-resistance

To test that idea, we used machine learning in two complementary ways. In one setting, each signal was converted into a tabular representation using descriptive statistics of the time series, such as means, variances, quantiles, skewness, spectral summaries, and hand-engineered features. In the other setting, the signal was treated directly as a raw time series and modeled with dedicated sequence classifiers.

The repository therefore benchmarks both classical tabular methods and specialized time-series models on several related tasks:

  • predicting whether bacteria are alive or dead from optical transmission measurements,
  • studying antibiotic response across concentration levels,
  • classifying Gram type, bacterial shape, and strain identity.

More broadly, this work was part of an effort to test whether photonic transmission traces can support automated biological sensing, with the longer-term goal of enabling larger-scale optical trapping datasets and more specialized models.

What is in this repository

This repository is a research-code snapshot rather than a polished software package. It contains the main preprocessing and modeling scripts used during the project, grouped by experimental setting.

1. AMP_measurements/

This folder contains code for experiments on antibiotic-level response using engineered features extracted from segmented transmission signals.

Main pieces:

  • data_processing.py merges analyzed documents with raw measurement documents, normalizes transmission traces, derives antibiotic quantities, and assigns known labels when possible.
  • feature_extraction.py computes descriptive features from segmented signal states such as OFF, ON, trapping, and on-to-trapping.
  • main.py builds a tabular pipeline for alive/dead prediction across antibiotic levels. It performs:
    • feature preparation,
    • train/validation/test splitting stratified by antibiotic concentration,
    • imputation and standardization,
    • synthetic augmentation with a simple GAN,
    • class balancing with SMOTE,
    • a dense neural network used as a lightweight transformer-style classifier,
    • a soft-voting ensemble combining KNN, SVM, and XGBoost.

In simple terms, this is the “feature-based antibiotic response” branch of the project.

2. VIABILITY_measurements/

This folder focuses on alive/dead prediction and on transferring a model trained on viability data to AMP measurements.

Main pieces:

  • data_processing_VIAB.py extracts tabular features directly from the analyzed viability documents and creates the binary dead target.
  • data_processing_AMP.py processes the AMP dataset into a compatible tabular feature format so that the model trained on viability data can be applied to it.
  • main.py trains an XGBoost classifier on a selected feature subset, uses SMOTE-based augmentation, tunes hyperparameters with Optuna, evaluates the classifier, and then predicts dead/alive status on the AMP dataset before plotting predicted outcomes by antibiotic concentration.
  • Mrock.py is the main time-series benchmarking script in this folder. It compares raw-signal classifiers including:
    • MultiRocket + XGBoost,
    • InceptionTime,
    • HIVE-COTE 2.0.

In simple terms, this folder asks:

  1. can a model learn alive vs. dead from transmission data,
  2. and can that signal-level knowledge transfer to antibiotic-response experiments?

3. GRAM_measurements/

This is the largest and most exploratory folder. It studies Gram type, cell shape, and bacteria strain classification from transmission signals.

Main pieces:

  • data_processing.py handles normalization, stratified splitting by bacteria family, chunking of time series, and augmentation by adding noise or shifting the signal.
  • main.py is the central experiment launcher. It contains several model branches ranging from simple baselines to tree-based models and neural models.
  • model_training.py implements baseline logic and reusable training utilities for classical models.
  • 1featuremodel.py explores a simple hand-engineered approach using FFT-based cross-distance features and XGBoost for bacteria classification.
  • Transformers.py, LSTM.py, NN_CNN.py, and U_net.py contain deep-learning experiments on raw or lightly processed time series.
  • nixtla_nn_reg.py and nixpred.py contain additional experiments using forecasting/time-series style approaches.
  • optuna_objectives.py, gram_clustering.py, and related helper scripts support hyperparameter search, clustering-style analyses, and evaluation.

In simple terms, this folder is the main benchmark suite for asking whether the optical transmission trace can identify not only broad labels such as Gram type, but also finer bacterial identity.

How the code is organized conceptually

Across the repository, the workflow follows the same general logic:

  1. Load analyzed and raw measurement documents
    The code assumes local pickle files or database-derived documents containing transmission signals and metadata.

  2. Normalize the optical signal
    Most scripts divide transmission traces by a normalization factor so that signals are comparable across samples.

  3. Represent the signal in one of two ways

    • Tabular mode: extract descriptive features such as mean, standard deviation, quantiles, skewness, maxima/minima, or FFT-derived summaries.
    • Time-series mode: keep the signal as a sequence and train a dedicated classifier on the raw trace.
  4. Split the data carefully
    Several scripts split data in a stratified way so that bacterial families or class ratios are preserved across training, validation, and test sets.

  5. Augment and rebalance when needed
    Depending on the script, the code uses methods such as noise injection, time shifting, chunking, GAN-based augmentation, and SMOTE.

  6. Train and benchmark models
    The repository compares a broad set of methods instead of committing to one model family too early.

Models explored

The repository includes experiments with several classes of models:

Classical / tabular models

  • XGBoost
  • Random Forest
  • KNN
  • SVM
  • RidgeClassifierCV
  • Voting ensembles

These are mainly used when the transmission signal is converted into tabular descriptors.

Time-series specialists

  • MultiRocket
  • HIVE-COTE 2.0
  • InceptionTime

These are used when the signal is treated as a sequence and benchmarked as a time-series classification problem.

Neural sequence models

  • CNN-based classifiers
  • LSTM / GRU-style models
  • Transformer-based models
  • U-Net-style sequence models
  • Additional Nixtla-style forecasting/representation experiments

These scripts are more exploratory and represent the “raw time-series” side of the project.

Why both tabular and raw time-series approaches?

The main scientific uncertainty in this project was not only whether the signal was informative, but also what kind of model representation was best matched to it.

A tabular representation is useful when domain knowledge suggests that a few summary descriptors of the signal may already capture the relevant biological information. A raw time-series representation is useful when the discriminative structure is distributed in time and may be better learned directly by sequence models.

This repository was built to test both hypotheses side by side.

Repository philosophy

This codebase should be read as a research exploration log:

  • it contains real experimental branches,
  • it compares several modeling paradigms,
  • it reflects iterative work rather than a single final production pipeline.

That is intentional. The purpose of the repository is to document how the problem was formulated, how the data were represented, and how different machine learning strategies were evaluated on the same underlying photonic sensing task.

Notes on reproducibility

This repository does not currently provide a fully packaged end-to-end reproduction setup. Some scripts expect:

  • local pickle files such as data.pkl, docs_analysed.pkl, or docs_meas.pkl,
  • environment-specific paths,
  • helper modules that may not be included in the public repo.

For that reason, the code is best understood as a research repository containing the main experimental logic and model implementations.

Dependencies

The scripts draw on a broad scientific Python stack, including:

  • numpy
  • pandas
  • scikit-learn
  • xgboost
  • tensorflow
  • torch
  • optuna
  • imblearn
  • scipy
  • matplotlib
  • sktime
  • tslearn

Some scripts may require additional utilities depending on the specific experiment branch.

Summary

At a high level, this repository studies whether optical transmission time series collected from bacteria trapped on a photonic chip contain enough information to support biological sensing tasks such as viability prediction, antibiotic-response analysis, and bacterial classification. It does so by systematically comparing feature-based tabular pipelines with raw time-series models across several datasets and experimental settings.