We are excited to announce the first official stable release of DiPTox on PyPI! This milestone brings production-ready stability and performance enhancements:
- Multi-Process Acceleration:
- Accelerate chemical preprocessing tasks by 10x or more using the
n_jobsparameter. - Intelligent task distribution across CPU cores for heavy datasets.
- Accelerate chemical preprocessing tasks by 10x or more using the
- Cross-Platform Robustness:
- Implemented a specialized "Guard Mechanism" for Windows multiprocessing to prevent memory explosion and recursive process spawning issues.
- Verified stability across Windows, Linux, and macOS environments.
- Enhanced Data Loading:
- Switched to binary stream parsing for
.sdfand.molfiles to resolve encoding crashes (e.g.,utf-8vslatin-1). - Auto-parsing of molecular structures to generate SMILES even when properties are missing.
- Switched to binary stream parsing for
- Data Loading Fixes: Fixed and optimized the native parsing logic for
.smi(SMILES) files, resolving previous reading issues to ensure stable ingestion of large-scale chemical databases. - Web Request Module Overhaul: Completely refactored the network request engine for stability and transparency. This update introduces a "Capability Map" and fast-fail logic to intelligently intercept unsupported queries and Auth/404 errors (eliminating infinite retry deadlocks). Furthermore, it eradicates "silent failures" by logging highly granular failure reasons (e.g.,
Failed -> pubchem: Not Found | chemspider: Auth Error (401)), and implements field-level data provenance to strictly record the exact source for each molecular property, drastically improving dataset auditability.
- Enhanced Unit Standardization: Added support for the standard math operator
^(power) by automatically mapping it to**, and fixed a logic error that caused single-unit datasets to be skipped even when a different target unit was specified. - Deduplication Logic Upgrades: Introduced a
log10transformation mode alongside the existing-log10option, enabling support for both toxicity data (pIC50) and physicochemical properties like water solubility (logS) or partition coefficients. - Robustness & Error Handling: Implemented strict numerical validation using
errors='coerce'in standardization and deduplication modules to automatically filter out invalid strings (e.g., "N/A", ">100") with clear warning feedback in the GUI. - Critical State Management Fix: Resolved an issue where
load_datafailed to reset the_preprocess_keyflag, ensuring that automatic column mapping logic for Web Requests (like auto-detectingsmiles_from_web) functions correctly after a new dataset is loaded.
- GUI State Management Fix: Resolved a
StreamlitAPIExceptionon the Export page that occurred when using the "Undo Last Step" feature. Implemented properon_clickcallbacks to safely mutate thesession_state(specifically forexport_selected_cols) before the UI re-renders, ensuring a crash-free and seamless undo experience. - Refined Preprocessing Rules: Adjusted and optimized several default charge neutralization rules.
- Enhanced Unit Standardization: Custom conversion formulas now fully support molecular weight (
mw). You can seamlessly convert between molarity and mass concentrations (e.g., using formulas likex * mw * 1000). - GUI Interface Optimization: The Streamlit graphical interface has been beautifully redesigned for a more professional, clean, and logical scientific layout. We've reduced visual clutter, grouped configuration panels intuitively, and improved component alignment.
- Comprehensive Audit Log (History): The processing history has been heavily upgraded. It now records granular parameters for every operation—including exactly which preprocessing rules were triggered, active deduplication conditions, web query statuses, and substructure search match counts.
To help us understand our user base and improve the software, DiPTox includes a one-time, optional survey on first use.
- Completely Optional: You can skip it with a single click.
- Privacy-Focused: The information helps us with academic impact assessment. It will not be shared.
Powered by Streamlit, the GUI allows users to perform all workflows visually without writing code.
- Visual Operation: Complete workflow control via a web browser.
- Real-time Preview: Instantly view data changes after applying rules.
- Rule Management: Add/Remove valid atoms, salts, solvents, and unit conversion formulas interactively.
- Smart Column Mapping: Intelligent detection of headers and binary file structures.
A configurable pipeline to clean and normalize chemical structures.
- Strict Inorganic Filtering: Updated SMARTS matching to accurately identify complex inorganic species (e.g., ionic cyanides) without misclassifying organic nitriles.
- Pipeline Steps:
- Remove salts & solvents
- Handle mixtures (keep largest fragment)
- Remove inorganic molecules
- Neutralize charges & Validate atomic composition
- Remove explicit hydrogens, stereochemistry, and isotopes
- Reject Radical Species: Automatically discard molecules containing free radical atoms.
- Standardize to canonical SMILES
- Filter by atom count
Normalize heterogeneous target data into a single unit effortlessly.
- Automatic Conversion: Built-in rules for Concentration, Time, Pressure, and Temperature.
- Custom Formulas: Define mathematical rules (e.g.,
x * 1000or10**(-x)) interactively via GUI or script. - Unified Output: Standardize diverse units (e.g.,
ug/mL,g/L,M) to a single target (e.g.,mg/L).
Flexible strategies for handling duplicate entries with advanced controls.
-
Data Types: Supports
continuous(e.g., IC50) anddiscrete(e.g., Active/Inactive) targets. -
Methods:
auto,IQR,3sigma,vote, or custom priority rules. -
Log Transformation: Optional
-log10transformation (e.g., IC50$\to$ pIC50) applied before deduplication logic to handle bioactivity data correctly. - Flexible NaN Handling: Option to retain rows with missing conditions (treating NaN as a valid group) instead of dropping them.
- Records every operation (Loading, Preprocessing, Filtering, etc.) in an Audit Log.
- Tracks timestamps, operation details, and row count changes (Delta) step-by-step.
- Available via API (
get_history()) and visualized in the GUI.
- Fetch and interconvert identifiers (CAS, SMILES, IUPAC, MW) from multiple sources (PubChem, ChemSpider, CompTox, Cactus, CAS Common Chemistry, ChEMBL).
- High-performance concurrent requests with automatic rate limiting and retries.
- Perform substructure searches using SMILES or SMARTS patterns.
- Customize chemical processing rules for neutralization reactions, salt/solvent lists, and valid atoms.
- Display a summary of all currently active processing rules.
You can install DiPTox using pip or via conda/mamba.
Install the official stable version from PyPI:
pip install diptoxInstalling diptox from the conda-forge channel can be achieved by adding conda-forge to your channels with:
conda config --add channels conda-forge
conda config --set channel_priority strictOnce the conda-forge channel has been enabled, diptox can be installed with conda:
conda install diptoxor with mamba:
mamba install diptoxAfter installation, you can launch the graphical interface directly from your terminal:
diptox-guiThis command will automatically open the DiPTox interface in your default web browser.
from diptox import DiptoxPipeline
def main():
# Initialize processor
DP = DiptoxPipeline()
# Load data
DP.load_data(input_data='file_path/list/dataframe', smiles_col, target_col, cas_col, unit_col)
# Customize Processing Rules (Optional)
print("--- Default Rules ---")
DP.display_processing_rules()
DP.manage_atom_rules(atoms=['Si'], add=True) # Add 'Si' to the list of valid atoms
DP.manage_default_salt(salts=['[Na+]'], add=False) # Example: remove sodium from the salt list
DP.manage_default_solvent(solvents='Cl', add=False) # Example: remove chlorine from the solvent list
DP.add_neutralization_rule('[$([N-]C=O)]', 'N') # Add a custom neutralization rule
print("\n--- Customized Rules ---")
DP.display_processing_rules()
# Configure preprocessing
DP.preprocess(
remove_salts=True, # Remove salt fragments. Default: True.
remove_solvents=True, # Remove solvent fragments. Default: True.
remove_mixtures=True, # Handle mixtures based on fragment size. Default: False.
hac_threshold=3, # Heavy atom count threshold for fragment removal. Default: 3.
keep_largest_fragment=True, # Keep the largest fragment in a mixture. Default: True.
remove_inorganic=False, # Remove common inorganic molecules. Default: True.
neutralize=True, # Neutralize charges on the molecule. Default: True.
reject_non_neutral=False, # Only retain the molecules whose formal charge is zero. Default: False.
check_valid_atoms=True, # Check if all atoms are in the valid list. Default: False.
strict_atom_check=False, # If True, discard molecules with invalid atoms. If False, try to remove them. Default: False.
remove_stereo=False, # Remove stereochemistry information. Default: False.
remove_isotopes=True, # Remove isotopic information. Default: True.
remove_hs=True, # Remove explicit hydrogen atoms. Default: True.
reject_radical_species=True, # Molecules containing free radical atoms are directly rejected. Default: True.
n_jobs=4 # Accelerate using 4 CPU cores. Default: 1
)
# Configure deduplication and unit standardization
conversion_rules = {('g/L', 'mg/L'): 'x * 1000',
('M', 'mg/L'): 'x * mw * 1000',}
DP.config_deduplicator(condition_cols, data_type, method, custom_method, priority, standard_unit, conversion_rules, log_transform, dropna_conditions)
DP.dataset_deduplicate()
# Configure web queries
DP.config_web_request(sources=['pubchem/chemspider/comptox/cactus/cas'], max_workers, ...)
DP.web_request(send='cas', request=['smiles', 'iupac'])
# Substructure search
DP.substructure_search(query_pattern, is_smarts=True)
# Save results
DP.save_results(output_path='file_path')
# View Processing History (Audit Log)
print(DP.get_history())
# Output Example:
# Step Timestamp Rows Before Rows After Delta Details
# 0 Data Loading 10:00:01 0 1000 +1000 Source: dataset.csv
# 1 Preprocessing 10:00:05 1000 950 -50 Valid: 950, Invalid: 50. Order: ...
# 2 Deduplication 10:00:08 950 800 -150 Method: auto (Log10 Transformed)
# CRITICAL: This protection block is REQUIRED for Windows multiprocessing!
# It prevents infinite recursive loops and memory explosion when n_jobs > 1.
if __name__ == '__main__':
main()DiPTox supports the following chemical databases:
PubChem: https://pubchem.ncbi.nlm.nih.gov/ChemSpider: https://www.chemspider.com/CompTox: https://comptox.epa.gov/dashboard/Cactus: https://cactus.nci.nih.gov/CAS: https://commonchemistry.cas.org/ChEMBL: https://www.ebi.ac.uk/chembl/
Note: ChemSpider, CompTox and CAS require API keys. Provide them during configuration:
DP.config_web_request(
sources=['chemspider/comptox/CAS'],
chemspider_api_key='your_personal_key',
comptox_api_key='your_personal_key',
cas_api_key='your_personal_key'
)Python>=3.8- Core Dependencies:
requestsrdkit>=2023.3tqdmopenpyxlscipystreamlit>=1.0.0(Required for GUI)
- Optional Dependencies (install as needed, if not installed, then send the request using
requests.):pubchempy>=1.0.5: For PubChem integrationchemspipy>=2.0.0: For ChemSpider (requires API key)ctx-python>=0.0.1a10: For CompTox Dashboard (requires API key)
Apache License 2.0 - See LICENSE for details
Report issues on GitHub Issues
