Skip to content

Hya0FAD/DiPTox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DiPTox - Data Integration and Processing for Computational Toxicology

PyPI Conda Conda Platforms License Python Version Chinese PyPI Downloads Conda Downloads

DiPTox Workflow Diagram

**DiPTox** is a Python toolkit designed for the robust preprocessing, standardization, and multi-source data integration of molecular datasets, with a focus on computational toxicology workflows.

Official Release v1.0

We are excited to announce the first official stable release of DiPTox on PyPI! This milestone brings production-ready stability and performance enhancements:

  • Multi-Process Acceleration:
    • Accelerate chemical preprocessing tasks by 10x or more using the n_jobs parameter.
    • Intelligent task distribution across CPU cores for heavy datasets.
  • Cross-Platform Robustness:
    • Implemented a specialized "Guard Mechanism" for Windows multiprocessing to prevent memory explosion and recursive process spawning issues.
    • Verified stability across Windows, Linux, and macOS environments.
  • Enhanced Data Loading:
    • Switched to binary stream parsing for .sdf and .mol files to resolve encoding crashes (e.g., utf-8 vs latin-1).
    • Auto-parsing of molecular structures to generate SMILES even when properties are missing.

Version Update Log (1.0.6)

  • Data Loading Fixes: Fixed and optimized the native parsing logic for .smi (SMILES) files, resolving previous reading issues to ensure stable ingestion of large-scale chemical databases.
  • Web Request Module Overhaul: Completely refactored the network request engine for stability and transparency. This update introduces a "Capability Map" and fast-fail logic to intelligently intercept unsupported queries and Auth/404 errors (eliminating infinite retry deadlocks). Furthermore, it eradicates "silent failures" by logging highly granular failure reasons (e.g., Failed -> pubchem: Not Found | chemspider: Auth Error (401)), and implements field-level data provenance to strictly record the exact source for each molecular property, drastically improving dataset auditability.

Version Update Log (1.0.5)

  • Enhanced Unit Standardization: Added support for the standard math operator ^ (power) by automatically mapping it to **, and fixed a logic error that caused single-unit datasets to be skipped even when a different target unit was specified.
  • Deduplication Logic Upgrades: Introduced a log10 transformation mode alongside the existing -log10 option, enabling support for both toxicity data (pIC50) and physicochemical properties like water solubility (logS) or partition coefficients.
  • Robustness & Error Handling: Implemented strict numerical validation using errors='coerce' in standardization and deduplication modules to automatically filter out invalid strings (e.g., "N/A", ">100") with clear warning feedback in the GUI.
  • Critical State Management Fix: Resolved an issue where load_data failed to reset the _preprocess_key flag, ensuring that automatic column mapping logic for Web Requests (like auto-detecting smiles_from_web) functions correctly after a new dataset is loaded.

Version Update Log (1.0.4)

  • GUI State Management Fix: Resolved a StreamlitAPIException on the Export page that occurred when using the "Undo Last Step" feature. Implemented proper on_click callbacks to safely mutate the session_state (specifically for export_selected_cols) before the UI re-renders, ensuring a crash-free and seamless undo experience.
  • Refined Preprocessing Rules: Adjusted and optimized several default charge neutralization rules.

Version Update Log (1.0.3)

  • Enhanced Unit Standardization: Custom conversion formulas now fully support molecular weight (mw). You can seamlessly convert between molarity and mass concentrations (e.g., using formulas like x * mw * 1000).
  • GUI Interface Optimization: The Streamlit graphical interface has been beautifully redesigned for a more professional, clean, and logical scientific layout. We've reduced visual clutter, grouped configuration panels intuitively, and improved component alignment.
  • Comprehensive Audit Log (History): The processing history has been heavily upgraded. It now records granular parameters for every operation—including exactly which preprocessing rules were triggered, active deduplication conditions, web query statuses, and substructure search match counts.

DiPTox Community Check-in (Optional)

To help us understand our user base and improve the software, DiPTox includes a one-time, optional survey on first use.

  • Completely Optional: You can skip it with a single click.
  • Privacy-Focused: The information helps us with academic impact assessment. It will not be shared.

Core Features

1. Graphical User Interface (GUI)

Powered by Streamlit, the GUI allows users to perform all workflows visually without writing code.

  • Visual Operation: Complete workflow control via a web browser.
  • Real-time Preview: Instantly view data changes after applying rules.
  • Rule Management: Add/Remove valid atoms, salts, solvents, and unit conversion formulas interactively.
  • Smart Column Mapping: Intelligent detection of headers and binary file structures.

2. Chemical Preprocessing & Standardization

A configurable pipeline to clean and normalize chemical structures.

  • Strict Inorganic Filtering: Updated SMARTS matching to accurately identify complex inorganic species (e.g., ionic cyanides) without misclassifying organic nitriles.
  • Pipeline Steps:
    • Remove salts & solvents
    • Handle mixtures (keep largest fragment)
    • Remove inorganic molecules
    • Neutralize charges & Validate atomic composition
    • Remove explicit hydrogens, stereochemistry, and isotopes
    • Reject Radical Species: Automatically discard molecules containing free radical atoms.
    • Standardize to canonical SMILES
    • Filter by atom count

3. Unit Standardization

Normalize heterogeneous target data into a single unit effortlessly.

  • Automatic Conversion: Built-in rules for Concentration, Time, Pressure, and Temperature.
  • Custom Formulas: Define mathematical rules (e.g., x * 1000 or 10**(-x)) interactively via GUI or script.
  • Unified Output: Standardize diverse units (e.g., ug/mL, g/L, M) to a single target (e.g., mg/L).

4. Data Deduplication

Flexible strategies for handling duplicate entries with advanced controls.

  • Data Types: Supports continuous (e.g., IC50) and discrete (e.g., Active/Inactive) targets.
  • Methods: auto, IQR, 3sigma, vote, or custom priority rules.
  • Log Transformation: Optional -log10 transformation (e.g., IC50 $\to$ pIC50) applied before deduplication logic to handle bioactivity data correctly.
  • Flexible NaN Handling: Option to retain rows with missing conditions (treating NaN as a valid group) instead of dropping them.

5. Comprehensive History Tracking (Audit Log)

  • Records every operation (Loading, Preprocessing, Filtering, etc.) in an Audit Log.
  • Tracks timestamps, operation details, and row count changes (Delta) step-by-step.
  • Available via API (get_history()) and visualized in the GUI.

6. Identifier & Property Integration

  • Fetch and interconvert identifiers (CAS, SMILES, IUPAC, MW) from multiple sources (PubChem, ChemSpider, CompTox, Cactus, CAS Common Chemistry, ChEMBL).
  • High-performance concurrent requests with automatic rate limiting and retries.

7. Utility Tools

  • Perform substructure searches using SMILES or SMARTS patterns.
  • Customize chemical processing rules for neutralization reactions, salt/solvent lists, and valid atoms.
  • Display a summary of all currently active processing rules.

Installation

You can install DiPTox using pip or via conda/mamba.

Option 1: Install from PyPI

Install the official stable version from PyPI:

pip install diptox

Option 2: Install from Conda-forge

Installing diptox from the conda-forge channel can be achieved by adding conda-forge to your channels with:

conda config --add channels conda-forge
conda config --set channel_priority strict

Once the conda-forge channel has been enabled, diptox can be installed with conda:

conda install diptox

or with mamba:

mamba install diptox

GUI

After installation, you can launch the graphical interface directly from your terminal:

diptox-gui

This command will automatically open the DiPTox interface in your default web browser.

Quick Start

from diptox import DiptoxPipeline

def main():
    # Initialize processor
    DP = DiptoxPipeline()

    # Load data
    DP.load_data(input_data='file_path/list/dataframe', smiles_col, target_col, cas_col, unit_col)

    # Customize Processing Rules (Optional)
    print("--- Default Rules ---")
    DP.display_processing_rules()

    DP.manage_atom_rules(atoms=['Si'], add=True)         # Add 'Si' to the list of valid atoms
    DP.manage_default_salt(salts=['[Na+]'], add=False)   # Example: remove sodium from the salt list
    DP.manage_default_solvent(solvents='Cl', add=False)  # Example: remove chlorine from the solvent list
    DP.add_neutralization_rule('[$([N-]C=O)]', 'N')      # Add a custom neutralization rule

    print("\n--- Customized Rules ---")
    DP.display_processing_rules()

    # Configure preprocessing
    DP.preprocess(
        remove_salts=True,              # Remove salt fragments. Default: True.
        remove_solvents=True,           # Remove solvent fragments. Default: True.
        remove_mixtures=True,           # Handle mixtures based on fragment size. Default: False.
        hac_threshold=3,                # Heavy atom count threshold for fragment removal. Default: 3.
        keep_largest_fragment=True,     # Keep the largest fragment in a mixture. Default: True.
        remove_inorganic=False,         # Remove common inorganic molecules. Default: True.
        neutralize=True,                # Neutralize charges on the molecule. Default: True.
        reject_non_neutral=False,       # Only retain the molecules whose formal charge is zero. Default: False.
        check_valid_atoms=True,         # Check if all atoms are in the valid list. Default: False.
        strict_atom_check=False,        # If True, discard molecules with invalid atoms. If False, try to remove them. Default: False.
        remove_stereo=False,            # Remove stereochemistry information. Default: False.
        remove_isotopes=True,           # Remove isotopic information. Default: True.
        remove_hs=True,                 # Remove explicit hydrogen atoms. Default: True.
        reject_radical_species=True,    # Molecules containing free radical atoms are directly rejected. Default: True.
        n_jobs=4                        # Accelerate using 4 CPU cores. Default: 1 
    )

    # Configure deduplication and unit standardization
    conversion_rules = {('g/L', 'mg/L'): 'x * 1000', 
                        ('M', 'mg/L'): 'x * mw * 1000',}
    DP.config_deduplicator(condition_cols, data_type, method, custom_method, priority, standard_unit, conversion_rules, log_transform, dropna_conditions)
    DP.dataset_deduplicate()

    # Configure web queries
    DP.config_web_request(sources=['pubchem/chemspider/comptox/cactus/cas'], max_workers, ...)
    DP.web_request(send='cas', request=['smiles', 'iupac'])

    # Substructure search
    DP.substructure_search(query_pattern, is_smarts=True)

    # Save results
    DP.save_results(output_path='file_path')

    # View Processing History (Audit Log)
    print(DP.get_history())
    # Output Example:
    #               Step Timestamp  Rows Before  Rows After   Delta                               Details
    # 0     Data Loading  10:00:01            0        1000   +1000                   Source: dataset.csv
    # 1    Preprocessing  10:00:05         1000         950     -50  Valid: 950, Invalid: 50. Order: ...
    # 2    Deduplication  10:00:08          950         800    -150       Method: auto (Log10 Transformed)

# CRITICAL: This protection block is REQUIRED for Windows multiprocessing!
# It prevents infinite recursive loops and memory explosion when n_jobs > 1.
if __name__ == '__main__':
    main()

Advanced Configuration

Web Service Integration

DiPTox supports the following chemical databases:

Note: ChemSpider, CompTox and CAS require API keys. Provide them during configuration:

DP.config_web_request(
    sources=['chemspider/comptox/CAS'],
    chemspider_api_key='your_personal_key',
    comptox_api_key='your_personal_key',
    cas_api_key='your_personal_key'
)

Requirements

  • Python>=3.8
  • Core Dependencies:
    • requests
    • rdkit>=2023.3
    • tqdm
    • openpyxl
    • scipy
    • streamlit>=1.0.0 (Required for GUI)
  • Optional Dependencies (install as needed, if not installed, then send the request using requests.):
    • pubchempy>=1.0.5: For PubChem integration
    • chemspipy>=2.0.0: For ChemSpider (requires API key)
    • ctx-python>=0.0.1a10: For CompTox Dashboard (requires API key)

License

Apache License 2.0 - See LICENSE for details

Support

Report issues on GitHub Issues