A dual-mode (Laptop/HPC) implementation of a pollution-aware stylometry pipeline combining symbolic span detection with geometric projection and constrained Transformer training.
- Phase A: Pollution Guard - GLiNER-based span detection + LEACE geometric projection
- Phase D: Constrained Transformer - Affine Guard layer for stylometric constraint enforcement
- Dual-Mode Execution - Automatic adaptation between laptop debug and HPC production environments
- Apache Arrow Backend - Memory-efficient data processing with zero-copy reads
pip install -e .For development:
pip install -e ".[dev]"For HPC features (cuml):
pip install -e ".[hpc]" --extra-index-url=https://pypi.nvidia.comTo install all (BEWARE cuml is Linux only)
pip install -e ".[all]" --extra-index-url=https://pypi.nvidia.comRemark
FP8 precision requires TransformerEngine, to install:
pip install --no-build-isolation transformer_engine[pytorch]-
Make sure to have all SOBR csv files in
datasets/ -
Create unified arrow dataset
# Convert raw data to Arrow format
python scripts/convert_pandas_to_arrow.py- (Optional)Create laptop partition
# Create laptop arrow partition from unified table
python scripts/create_laptop_dataset.py- Run full pipeline
# Run the full pipeline
python -m neuro_stylometry run-full-pipeline \
--dataset artifacts/data/sobr.arrow \ # or sobr_laptop.arrow
--output-dir artifacts/full_pipeline_output \
--mode laptop \ # or hpc
--use-only nationalityThe convert_pandas_to_arrow command converts the raw SOBR demographic CSV files into a unified Arrow dataset. It validates the input files, processes the data, and outputs a single Arrow file for use in the pipeline.
python scripts/convert_pandas_to_arrow.py \
--raw-data-dir datasets/ \
--output artifacts/data/sobr.arrow \
--verbose| Parameter | Type | Default Value | Description |
|---|---|---|---|
--raw-data-dir |
Path (optional) |
datasets/ |
Directory containing the 8 demographic CSV files. |
--output |
Path (optional) |
artifacts/data/sobr.arrow |
Output path for unified Arrow file. |
--verbose |
Flag |
False |
Enable verbose logging (DEBUG level). |
Ensure that the --raw-data-dir contains all 8 required demographic CSV files before running this command. Missing or invalid files will cause the conversion process to fail.
The run-full-pipeline command is the primary entry point for running the entire pipeline. It executes all phases (Phase A and Phase D) in sequence, including optional verification and report generation steps. This command is designed to simplify the pipeline execution process.
python -m neuro_stylometry run-full-pipeline \
--dataset datasets/nationality.arrow \
--output-dir artifacts/full_pipeline_output \
--mode laptop \
--use-only nationality| Parameter | Type | Default Value | Description |
|---|---|---|---|
--dataset |
Path (required) |
None | Path to the input dataset Arrow file. |
--output-dir |
Path (required) |
None | Output directory for all pipeline artifacts. |
--mode |
Choice |
auto |
Hardware mode: auto, laptop, or hpc. |
--config |
Path (optional) |
None | Optional experiment YAML config to merge on top of base+mode configs. |
--use-only |
str (multiple) |
None | Filter to specified demographic labels (e.g., --use-only gender). |
--skip-phase-a |
Flag |
False |
Skip Phase A (requires existing Phase A artifacts in output-dir/phase_a). |
--skip-phase-d |
Flag |
False |
Skip Phase D (only run Phase A). |
--skip-verify |
Flag |
False |
Skip Phase D verification (CHG + SVS). |
--skip-report |
Flag |
False |
Skip Phase D report generation. |
--dry-run |
Flag |
False |
Print resolved config and execution plan without running. |
Using a laptop-sized partition sample (e.g., created using the create_laptop_dataset.py script) may lead to statistically irrelevant findings. This can cause the pipeline to fail due to insufficient data for meaningful analysis. It is recommended to use the full dataset for robust results.
The create_laptop_dataset command generates a lightweight, representative subset of the full SOBR Arrow dataset. This is particularly useful for development and debugging on systems with limited resources, as it allows for rapid iteration while preserving the structural fidelity of the dataset.
python scripts/create_laptop_dataset.py \
--input artifacts/data/sobr.arrow \
--output artifacts/data/sobr_laptop.arrow \
--target-size 5000 \
--seed 123| Parameter | Type | Default Value | Description |
|---|---|---|---|
--input |
Path (optional) |
artifacts/data/sobr.arrow |
Path to the full SOBR Arrow file. |
--output |
Path (optional) |
artifacts/data/sobr_laptop.arrow |
Output path for the laptop subset. |
--target-size |
int (optional) |
10000 |
Approximate number of posts in the subset. |
--seed |
int (optional) |
42 |
Random seed for reproducibility. |
--verbose |
Flag |
False |
Enable verbose logging. |
The laptop partition is an optional feature that allows you to create a smaller, representative subset of the full dataset for development and debugging purposes. This is particularly useful for rapid iteration on systems with limited resources. The size of the subset can be controlled using the --target-size CLI flag. For example:
python scripts/create_laptop_dataset.py \
--input artifacts/data/sobr.arrow \
--output artifacts/data/sobr_laptop.arrow \
--target-size 5000By default, the target size is set to 10,000 posts. However, this is not required for running the full pipeline. The full dataset can be used directly without creating a laptop partition.
Important: While the laptop partition is useful for development, it may lead to statistically irrelevant findings due to the reduced dataset size. For production-level training, we recommend using the full dataset. The pipeline has been tested and trained on a VM with an NVIDIA RTX 5090 GPU using the full dataset for optimal results.
For advanced users, the pipeline can also be executed in separate phases using the following commands:
This command runs only Phase A of the pipeline, which performs pollution detection and mitigation.
python -m neuro_stylometry run-phase-a \
--dataset datasets/sobr.arrow \
--output-dir artifacts/phase_a_output \
--mode laptop- Detects pollution spans using GLiNER.
- Applies LEACE projection to remove pollution information.
- Outputs artifacts such as
clean_dataset.arrow,projection_matrix.pt, andpollution_logs.arrow.
This command runs only Phase D of the pipeline, which trains a constrained Transformer model.
python -m neuro_stylometry run-phase-d \
--dataset artifacts/phase_d/tokenized_dataset.arrow \
--output-dir artifacts/phase_d_test \
--artifacts-dir artifacts/phase_a \
--mode hpc \
--config conf/experiments/quick_test.yaml- Trains a baseline Transformer model on unmasked data.
- Trains a constrained Transformer model using the LEACE projection matrix.
- Outputs comparative metrics and trained model checkpoints.
MIT License