Research code for machine learning on photonic crystal optical transmission signals for antimicrobial sensing and bacterial characterization.
This project explores a simple but important scientific question: when a bacterium is trapped on a photonic chip, the device records how light transmission sent to the bacterium changes over time. The core hypothesis behind this work is that this temporal optical signal contains a measurable signature of the bacterium’s biological state.
More importantly, if successful the project would prove to be a new way of fighting one of the 10 most alarming crisis for the WHO, Antimicrobial Resistance : https://www.who.int/news-room/fact-sheets/detail/antimicrobial-resistance
To test that idea, we used machine learning in two complementary ways. In one setting, each signal was converted into a tabular representation using descriptive statistics of the time series, such as means, variances, quantiles, skewness, spectral summaries, and hand-engineered features. In the other setting, the signal was treated directly as a raw time series and modeled with dedicated sequence classifiers.
The repository therefore benchmarks both classical tabular methods and specialized time-series models on several related tasks:
- predicting whether bacteria are alive or dead from optical transmission measurements,
- studying antibiotic response across concentration levels,
- classifying Gram type, bacterial shape, and strain identity.
More broadly, this work was part of an effort to test whether photonic transmission traces can support automated biological sensing, with the longer-term goal of enabling larger-scale optical trapping datasets and more specialized models.
This repository is a research-code snapshot rather than a polished software package. It contains the main preprocessing and modeling scripts used during the project, grouped by experimental setting.
This folder contains code for experiments on antibiotic-level response using engineered features extracted from segmented transmission signals.
Main pieces:
data_processing.pymerges analyzed documents with raw measurement documents, normalizes transmission traces, derives antibiotic quantities, and assigns known labels when possible.feature_extraction.pycomputes descriptive features from segmented signal states such as OFF, ON, trapping, and on-to-trapping.main.pybuilds a tabular pipeline for alive/dead prediction across antibiotic levels. It performs:- feature preparation,
- train/validation/test splitting stratified by antibiotic concentration,
- imputation and standardization,
- synthetic augmentation with a simple GAN,
- class balancing with SMOTE,
- a dense neural network used as a lightweight transformer-style classifier,
- a soft-voting ensemble combining KNN, SVM, and XGBoost.
In simple terms, this is the “feature-based antibiotic response” branch of the project.
This folder focuses on alive/dead prediction and on transferring a model trained on viability data to AMP measurements.
Main pieces:
data_processing_VIAB.pyextracts tabular features directly from the analyzed viability documents and creates the binarydeadtarget.data_processing_AMP.pyprocesses the AMP dataset into a compatible tabular feature format so that the model trained on viability data can be applied to it.main.pytrains an XGBoost classifier on a selected feature subset, uses SMOTE-based augmentation, tunes hyperparameters with Optuna, evaluates the classifier, and then predicts dead/alive status on the AMP dataset before plotting predicted outcomes by antibiotic concentration.Mrock.pyis the main time-series benchmarking script in this folder. It compares raw-signal classifiers including:- MultiRocket + XGBoost,
- InceptionTime,
- HIVE-COTE 2.0.
In simple terms, this folder asks:
- can a model learn alive vs. dead from transmission data,
- and can that signal-level knowledge transfer to antibiotic-response experiments?
This is the largest and most exploratory folder. It studies Gram type, cell shape, and bacteria strain classification from transmission signals.
Main pieces:
data_processing.pyhandles normalization, stratified splitting by bacteria family, chunking of time series, and augmentation by adding noise or shifting the signal.main.pyis the central experiment launcher. It contains several model branches ranging from simple baselines to tree-based models and neural models.model_training.pyimplements baseline logic and reusable training utilities for classical models.1featuremodel.pyexplores a simple hand-engineered approach using FFT-based cross-distance features and XGBoost for bacteria classification.Transformers.py,LSTM.py,NN_CNN.py, andU_net.pycontain deep-learning experiments on raw or lightly processed time series.nixtla_nn_reg.pyandnixpred.pycontain additional experiments using forecasting/time-series style approaches.optuna_objectives.py,gram_clustering.py, and related helper scripts support hyperparameter search, clustering-style analyses, and evaluation.
In simple terms, this folder is the main benchmark suite for asking whether the optical transmission trace can identify not only broad labels such as Gram type, but also finer bacterial identity.
Across the repository, the workflow follows the same general logic:
-
Load analyzed and raw measurement documents
The code assumes local pickle files or database-derived documents containing transmission signals and metadata. -
Normalize the optical signal
Most scripts divide transmission traces by a normalization factor so that signals are comparable across samples. -
Represent the signal in one of two ways
- Tabular mode: extract descriptive features such as mean, standard deviation, quantiles, skewness, maxima/minima, or FFT-derived summaries.
- Time-series mode: keep the signal as a sequence and train a dedicated classifier on the raw trace.
-
Split the data carefully
Several scripts split data in a stratified way so that bacterial families or class ratios are preserved across training, validation, and test sets. -
Augment and rebalance when needed
Depending on the script, the code uses methods such as noise injection, time shifting, chunking, GAN-based augmentation, and SMOTE. -
Train and benchmark models
The repository compares a broad set of methods instead of committing to one model family too early.
The repository includes experiments with several classes of models:
- XGBoost
- Random Forest
- KNN
- SVM
- RidgeClassifierCV
- Voting ensembles
These are mainly used when the transmission signal is converted into tabular descriptors.
- MultiRocket
- HIVE-COTE 2.0
- InceptionTime
These are used when the signal is treated as a sequence and benchmarked as a time-series classification problem.
- CNN-based classifiers
- LSTM / GRU-style models
- Transformer-based models
- U-Net-style sequence models
- Additional Nixtla-style forecasting/representation experiments
These scripts are more exploratory and represent the “raw time-series” side of the project.
The main scientific uncertainty in this project was not only whether the signal was informative, but also what kind of model representation was best matched to it.
A tabular representation is useful when domain knowledge suggests that a few summary descriptors of the signal may already capture the relevant biological information. A raw time-series representation is useful when the discriminative structure is distributed in time and may be better learned directly by sequence models.
This repository was built to test both hypotheses side by side.
This codebase should be read as a research exploration log:
- it contains real experimental branches,
- it compares several modeling paradigms,
- it reflects iterative work rather than a single final production pipeline.
That is intentional. The purpose of the repository is to document how the problem was formulated, how the data were represented, and how different machine learning strategies were evaluated on the same underlying photonic sensing task.
This repository does not currently provide a fully packaged end-to-end reproduction setup. Some scripts expect:
- local pickle files such as
data.pkl,docs_analysed.pkl, ordocs_meas.pkl, - environment-specific paths,
- helper modules that may not be included in the public repo.
For that reason, the code is best understood as a research repository containing the main experimental logic and model implementations.
The scripts draw on a broad scientific Python stack, including:
numpypandasscikit-learnxgboosttensorflowtorchoptunaimblearnscipymatplotlibsktimetslearn
Some scripts may require additional utilities depending on the specific experiment branch.
At a high level, this repository studies whether optical transmission time series collected from bacteria trapped on a photonic chip contain enough information to support biological sensing tasks such as viability prediction, antibiotic-response analysis, and bacterial classification. It does so by systematically comparing feature-based tabular pipelines with raw time-series models across several datasets and experimental settings.