diff --git a/tutorials/data_exploration_chebi.ipynb b/tutorials/data_exploration_chebi.ipynb new file mode 100644 index 00000000..81256f4a --- /dev/null +++ b/tutorials/data_exploration_chebi.ipynb @@ -0,0 +1,1099 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0bd757ea-a6a0-43f8-8701-cafb44f20f6b", + "metadata": {}, + "source": [ + "# Introduction\n", + "\n", + "This notebook serves as a guide for new developers using the `chebai` package. If you just want to run the experiments, you can refer to the [README.md](https://github.com/ChEB-AI/python-chebai/blob/dev/README.md) and the [wiki](https://github.com/ChEB-AI/python-chebai/wiki) for the basic commands. This notebook explains what happens under the hood for the ChEBI dataset. It covers\n", + "- how to instantiate a data class and generate data\n", + "- how the data is processed and stored\n", + "- and how to work with different molecule encodings.\n", + "\n", + "The chebai package simplifies the handling of these datasets by **automatically creating** them as needed. This means that you do not have to input any data manually; the package will generate and organize the data files based on the parameters and encodings selected. This feature ensures that the right data is available and formatted properly. You can however provide your own data files, for instance if you want to replicate a specific experiment.\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "id": "4550d01fc7af5ae4", + "metadata": {}, + "source": [ + "# 1. Instantiation of a Data Class\n", + "\n", + "To start working with `chebai`, you first need to instantiate a ChEBI data class. This class is responsible for managing, interacting with, and preprocessing the ChEBI chemical data." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "990cc6f2-6b4a-4fa7-905f-dda183c3ec4c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Already in the project root directory: G:\\github-aditya0by0\\python-chebai\n" + ] + } + ], + "source": [ + "# To run this notebook, you need to change the working directory of the jupyter notebook to root dir of the project.\n", + "import os\n", + "\n", + "# Root directory name of the project\n", + "expected_root_dir = \"python-chebai\"\n", + "\n", + "# Check if the current directory ends with the expected root directory name\n", + "if not os.getcwd().endswith(expected_root_dir):\n", + " os.chdir(\"..\") # Move up one directory level\n", + " if os.getcwd().endswith(expected_root_dir):\n", + " print(\"Changed to project root directory:\", os.getcwd())\n", + " else:\n", + " print(\"Warning: Directory change unsuccessful. Current directory:\", os.getcwd())\n", + "else:\n", + " print(\"Already in the project root directory:\", os.getcwd())" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "f3a66e07-edc9-4aa2-9cd0-d4ea58914d22", + "metadata": {}, + "outputs": [], + "source": [ + "from chebai.preprocessing.datasets.chebi import ChEBIOver50" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "a71b7301-6195-4155-a439-f5eb3183d0f3", + "metadata": { + "ExecuteTime": { + "end_time": "2024-10-05T21:07:26.371796Z", + "start_time": "2024-10-05T21:07:26.058728Z" + } + }, + "outputs": [], + "source": [ + "chebi_class = ChEBIOver50(chebi_version=231)" + ] + }, + { + "cell_type": "markdown", + "id": "b810d7c9-4f7f-4725-9bc2-452ff2c3a89d", + "metadata": {}, + "source": [ + "\n", + "### Inheritance Hierarchy\n", + "\n", + "ChEBI data classes inherit from [`_DynamicDataset`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L598), which in turn inherits from [`XYBaseDataModule`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L23). Specifically:\n", + "\n", + "- **`_DynamicDataset`**: This class serves as an intermediate base class that provides additional functionality or customization for datasets that require dynamic behavior. It inherits from `XYBaseDataModule`, which provides the core methods for data loading and processing.\n", + "\n", + "- **`XYBaseDataModule`**: This is the base class for data modules, providing foundational properties and methods for handling and processing datasets, including data splitting, loading, and preprocessing.\n", + "\n", + "In summary, ChEBI data classes are designed to manage and preprocess chemical data effectively by leveraging the capabilities provided by `XYBaseDataModule` through the `_DynamicDataset` intermediary.\n", + "\n", + "\n", + "### Input parameters\n", + "A ChEBI data class can be configured with a range of parameters, including:\n", + "\n", + "- **chebi_version (int)**: Specifies the version of the ChEBI database to be used. The default is `200`. Specifying a version ensures the reproducibility of your experiments by using a consistent dataset.\n", + "\n", + "- **chebi_version_train (int, optional)**: The version of ChEBI to use specifically for training and validation. If not set, the `chebi_version` specified will be used for all data splits, including training, validation, and test. Defaults to `None`.\n", + "\n", + "- **splits_file_path (str, optional)**: Path to a CSV file containing data splits. If not provided, the class will handle splits internally. Defaults to `None`.\n", + "\n", + "### Additional Input Parameters\n", + "\n", + "To get more control over various aspects of data loading, processing, and splitting, you can refer to documentation of additional parameters in docstrings of the respective classes: [`_ChEBIDataExtractor`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/chebi.py#L108), [`XYBaseDataModule`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L22), [`_DynamicDataset`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L597), etc.\n" + ] + }, + { + "cell_type": "markdown", + "id": "8578b7aa-1bd9-4e50-9eee-01bfc6d5464a", + "metadata": {}, + "source": [ + "# Available ChEBI Data Classes\n", + "\n", + "__Note__: Check the code implementation of classes [here](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/chebi.py):\n", + "\n", + "There is a range of available dataset classes for ChEBI. Usually, you want to use `ChEBIOver100` or `ChEBIOver50`. The number indicates the threshold for selecting label classes: ChEBI classes which have at least 100 / 50 SMILES-annotated subclasses will be used as labels.\n", + "\n", + "Both inherit from `ChEBIOverX`. If you need a different threshold, you can create your own subclass. By default, `ChEBIOverX` uses the SMILES encoding (see Section 5). The other implemented encodings are SELFIES and DeepSMILES, used by the classes `ChEBIOverXSELFIES` and `ChEBIOverXDeepSMILES`, respectively. \n", + "They also have subclasses for different thresholds (`ChEBIOver50SELFIES`, `ChEBIOver100SELFIES`, `ChEBIOver100DeepSMILES`).\n", + "\n", + "Finally, `ChEBIOver50Partial` selects extracts a part of ChEBI based on a given top class, with a threshold of 50 for selecting labels.\n", + "This class inherits from `ChEBIOverXPartial` and `ChEBIOver50`.\n" + ] + }, + { + "cell_type": "markdown", + "id": "8456b545-88c5-401d-baa5-47e8ae710f04", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "ed973fb59df11849", + "metadata": {}, + "source": [ + "# 2. Preparation / Setup Methods\n", + "\n", + "Now we have a ChEBI data class with all the relevant parameters. Next, we need to generate the actual dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "d0a58e2bd9c0e6d9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Checking for processed data in data\\chebi_v231\\ChEBI50\\processed\n", + "Missing processed data file (`data.pkl` file)\n", + "Missing raw chebi data related to version: v_231, Downloading...\n", + "Compute transitive closure\n", + "Process graph\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Check for processed data in data\\chebi_v231\\ChEBI50\\processed\\smiles_token\n", + "Cross-validation enabled: False\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Missing transformed data (`data.pt` file). Transforming data.... \n", + "Processing 185007 lines...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████████████████████████████████████| 185007/185007 [05:43<00:00, 539.23it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "saving 771 tokens to G:\\github-aditya0by0\\python-chebai\\chebai\\preprocessing\\bin\\smiles_token\\tokens.txt...\n", + "first 10 tokens: ['[*-]', '[Al-]', '[F-]', '.', '[H]', '[N]', '(', ')', '[Ag+]', 'C']\n" + ] + } + ], + "source": [ + "chebi_class.prepare_data()\n", + "chebi_class.setup()" + ] + }, + { + "cell_type": "markdown", + "id": "1655d489-25fe-46de-9feb-eeca5d36936f", + "metadata": {}, + "source": [ + "\n", + "### Automatic Execution: \n", + "These methods are executed automatically when using the training command `chebai fit`. Users do not need to call them explicitly, as the code internally manages the preparation and setup of data, ensuring that it is ready for subsequent use in training and validation processes.\n", + "\n", + "### Why is Preparation Needed?\n", + "\n", + "- **Data Availability**: The preparation step ensures that the required ChEBI data files are downloaded or loaded, which are essential for analysis.\n", + "- **Data Integrity**: It ensures that the data files are transformed into a compatible format required for model input.\n", + "\n", + "### Main Methods for Data Preprocessing\n", + "\n", + "The data preprocessing in a data class involves two main methods:\n", + "\n", + "1. **`prepare_data` Method**:\n", + " - **Purpose**: This method checks for the presence of raw data in the specified directory. If the raw data is missing, it fetches the ontology, creates a dataframe, and saves it to a file (`data.pkl`). The dataframe includes columns such as IDs, data representations, and labels. This step is independent of input encodings and all chemicals are stored as SMILES strings.\n", + " - **Documentation**: [PyTorch Lightning - `prepare_data`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#prepare-data)\n", + "\n", + "2. **`setup` Method**:\n", + " - **Purpose**: This method sets up the data module for training, validation, and testing. It checks for the processed data and, if necessary, performs additional setup to ensure the data is ready for model input. It also handles cross-validation settings if enabled.\n", + " - **Description**: Transforms `data.pkl` into a model input data format (`data.pt`), tokenizing the input according to the specified encoding. The transformed data contains the following keys: `ident`, `features`, `labels`, and `group`. This method uses a subclass of Data Reader to perform the tokenization.\n", + " - **Documentation**: [PyTorch Lightning - `setup`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#setup)\n", + "\n", + "These methods ensure that the data is correctly prepared and set up for subsequent use in training and validation processes." + ] + }, + { + "cell_type": "markdown", + "id": "f5aaa12d-5f01-4b74-8b59-72562af953bf", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "bb6e9a81554368f7", + "metadata": {}, + "source": [ + "# 3. Overview of the 3 preprocessing stages\n", + "\n", + "The `chebai` library follows a three-stage preprocessing pipeline, which is reflected in its file structure:\n", + "\n", + "1. **Raw Data Stage**:\n", + " - **File**: `chebi.obo`\n", + " - **Description**: This stage contains the raw ChEBI ontology data, serving as the initial input for further processing.\n", + " - **File Path**: `data/${chebi_version}/${dataset_name}/raw/${filename}.obo`\n", + "\n", + "2. **Processed Data Stage 1**:\n", + " - **File**: `data.pkl`\n", + " - **Description**: This stage includes the data after initial processing. It contains SMILES strings, class columns, and metadata but lacks data splits.\n", + " - **File Path**: `data/${chebi_version}/${dataset_name}/processed/data.pkl`\n", + " - **Additional File**: `classes.txt` - A file listing the relevant ChEBI classes.\n", + "\n", + "3. **Processed Data Stage 2**:\n", + " - **File**: `data.pt`\n", + " - **Description**: This final stage includes the encoded data in a format compatible with PyTorch, ready for model input. This stage also references data splits when available.\n", + " - **File Path**: `data/${chebi_version}/${dataset_name}/processed/${reader_name}/data.pt`\n", + " - **Additional File**: `splits.csv` - Contains saved splits for reproducibility.\n", + "\n", + "### Summary of File Paths\n", + "\n", + "- **Raw Data**: `data/${chebi_version}/${dataset_name}/raw`\n", + "- **Processed Data 1**: `data/${chebi_version}/${dataset_name}/processed`\n", + "- **Processed Data 2**: `data/${chebi_version}/${dataset_name}/processed/${reader_name}`\n", + "\n", + "This structured approach to data management ensures that each stage of data processing is well-organized and documented, from raw data acquisition to the preparation of model-ready inputs. It also facilitates reproducibility and traceability across different experiments.\n", + "\n", + "### Data Splits\n", + "\n", + "- **Creation**: Data splits are generated dynamically \"on the fly\" during training and evaluation to ensure flexibility and adaptability to different tasks.\n", + "- **Reproducibility**: To maintain consistency across different runs, splits can be reproduced by comparing hashes with a fixed seed value.\n" + ] + }, + { + "cell_type": "markdown", + "id": "7e172c0d1e8bb93f", + "metadata": {}, + "source": [ + "# 4. Data Files and their structure\n", + "\n", + "`chebai` creates and manages several data files during its operation. These files store various chemical data and metadata essential for different tasks. Let’s explore these files and their content.\n" + ] + }, + { + "cell_type": "markdown", + "id": "43329709-5134-4ce5-88e7-edd2176bf84d", + "metadata": {}, + "source": [ + "## chebi.obo File\n", + "\n", + "**Description**: Contains the raw ChEBI ontology data, downloaded directly from the ChEBI website. This file serves as the foundation for data processing.\n", + " \n", + "\n", + "#### Example of a Term Document\n", + "\n", + "```plaintext\n", + "[Term]\n", + "id: CHEBI:24867\n", + "name: monoatomic ion\n", + "subset: 3_STAR\n", + "synonym: \"monoatomic ions\" RELATED [ChEBI]\n", + "is_a: CHEBI:24870\n", + "is_a: CHEBI:33238\n", + "```\n", + "\n", + "**File Path**: `data/${chebi_version}/${dataset_name}/raw/${filename}.obo`\n", + "\n", + "\n", + "### Structure of `chebi.obo`\n", + "\n", + "The `chebi.obo` file is organized into blocks of text known as \"term documents.\" Each block starts with a `[Term]` header and contains various attributes that describe a specific chemical entity within the ChEBI ontology. These attributes include identifiers, names, relationships to other entities, and more.\n", + "\n", + "\n", + "### Breakdown of Attributes\n", + "\n", + "Each term document in the `chebi.obo` file consists of the following key attributes:\n", + "\n", + "- **`[Term]`**: \n", + " - **Description**: Indicates the beginning of a new term in the ontology. Each term represents a distinct chemical entity.\n", + "\n", + "- **`id: CHEBI:24867`**: \n", + " - **Description**: A unique identifier for the chemical entity within the ChEBI database.\n", + " - **Example**: `CHEBI:24867` refers to the entity \"monoatomic ion.\"\n", + "\n", + "- **`name: monoatomic ion`**: \n", + " - **Description**: The common name of the chemical entity. This is the main descriptor used to identify the term.\n", + " - **Example**: \"monoatomic ion\" is the namcating a related term within the ChEBI ontology.\n", + "\n", + "- **`is_a: CHEBI:24870`** and **`is_a: CHEBI:33238`**: \n", + " - **Description**: Defines hierarchical relationships to other terms within the ontology. The `is_a` attribute indicates that the current entity is a subclass or specific instance of the referenced term.\n", + " - **Example**: The entity `CHEBI:24867` (\"monoatomic ion\") is a subclass of both `CHEBI:24870` and `CHEBI:33238`, meaent stages of preprocessing, from raw input files to processed, model-ready formats." + ] + }, + { + "cell_type": "markdown", + "id": "558295e5a7ded456", + "metadata": {}, + "source": [ + "## data.pkl File\n", + "\n", + "**Description**: Generated by the `prepare_data` method, this file contains processed data in a dataframe format. It includes the CHEBI-IDs, chemical representations (SMILES strings), and columns for each label with boolean values." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "fd490270-59b8-4c1c-8b09-204defddf592", + "metadata": { + "ExecuteTime": { + "end_time": "2024-10-05T21:09:01.622317Z", + "start_time": "2024-10-05T21:09:01.606698Z" + } + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import os" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "d7d16247-092c-4e8d-96c2-ab23931cf766", + "metadata": { + "ExecuteTime": { + "end_time": "2024-10-05T21:11:51.296162Z", + "start_time": "2024-10-05T21:11:44.559304Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Size of the data (rows x columns): (185007, 1514)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idnameSMILES1722244024682571258026343098...176910177333183508183509189832189840192499194321197504229684
033429monoatomic monoanion[*-]FalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
130151aluminide(1-)[Al-]FalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
216042halide anion[*-]FalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
317051fluoride[F-]FalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
428741sodium fluoride[F-].[Na+]FalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", + "

5 rows × 1514 columns

\n", + "
" + ], + "text/plain": [ + " id name SMILES 1722 2440 2468 2571 2580 \\\n", + "0 33429 monoatomic monoanion [*-] False False False False False \n", + "1 30151 aluminide(1-) [Al-] False False False False False \n", + "2 16042 halide anion [*-] False False False False False \n", + "3 17051 fluoride [F-] False False False False False \n", + "4 28741 sodium fluoride [F-].[Na+] False False False False False \n", + "\n", + " 2634 3098 ... 176910 177333 183508 183509 189832 189840 192499 \\\n", + "0 False False ... False False False False False False False \n", + "1 False False ... False False False False False False False \n", + "2 False False ... False False False False False False False \n", + "3 False False ... False False False False False False False \n", + "4 False False ... False False False False False False False \n", + "\n", + " 194321 197504 229684 \n", + "0 False False False \n", + "1 False False False \n", + "2 False False False \n", + "3 False False False \n", + "4 False False False \n", + "\n", + "[5 rows x 1514 columns]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pkl_df = pd.DataFrame(\n", + " pd.read_pickle(\n", + " os.path.join(\n", + " chebi_class.processed_dir_main,\n", + " chebi_class.processed_dir_main_file_names_dict[\"data\"],\n", + " )\n", + " )\n", + ")\n", + "print(\"Size of the data (rows x columns): \", pkl_df.shape)\n", + "pkl_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "322bc926-69ff-4b93-9e95-5e8b85869c38", + "metadata": {}, + "source": [ + "**File Path**: `data/${chebi_version}/${dataset_name}/processed/data.pkl`\n", + "\n", + "\n", + "### Structure of `data.pkl`\n", + "`data.pkl` as following structure: \n", + "- **Column 0**: Contains the ID of each ChEBI data instance.\n", + "- **Column 1**: Contains the name of each ChEBI data instance.\n", + "- **Column 2**: Contains the SMILES representation of the chemical.\n", + "- **Column 3 and onwards**: Contains the labels, starting from column 3.\n", + "\n", + "This structure ensures that the data is organized and ready for further processing, such as further encoding.\n" + ] + }, + { + "cell_type": "markdown", + "id": "ba019d2d4324bd0b", + "metadata": {}, + "source": [ + "## data.pt File\n", + "\n", + "\n", + "**Description**: Generated by the `setup` method, this file contains encoded data in a format compatible with the PyTorch library, specifically as a list of dictionaries. Each dictionary in this list includes keys such as `ident`, `features`, `labels`, and `group`, ready for model input." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "977ddd83-b469-4b58-ab1a-8574fb8769b4", + "metadata": { + "ExecuteTime": { + "end_time": "2024-10-05T21:12:49.338943Z", + "start_time": "2024-10-05T21:12:49.323319Z" + } + }, + "outputs": [], + "source": [ + "import torch" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "3266ade9-efdc-49fe-ae07-ed52b2eb52d0", + "metadata": { + "ExecuteTime": { + "end_time": "2024-10-05T21:14:12.892845Z", + "start_time": "2024-10-05T21:13:59.859953Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Type of loaded data: \n" + ] + } + ], + "source": [ + "data_pt = torch.load(\n", + " os.path.join(\n", + " chebi_class.processed_dir, chebi_class.processed_file_names_dict[\"data\"]\n", + " ),\n", + " weights_only=False,\n", + ")\n", + "print(\"Type of loaded data:\", type(data_pt))" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "84cfa3e6-f60d-47c0-9f82-db3d5673d1e7", + "metadata": { + "ExecuteTime": { + "end_time": "2024-10-05T21:14:21.185027Z", + "start_time": "2024-10-05T21:14:21.169358Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'features': [10], 'labels': array([False, False, False, ..., False, False, False]), 'ident': 33429, 'group': None}\n", + "{'features': [11], 'labels': array([False, False, False, ..., False, False, False]), 'ident': 30151, 'group': None}\n", + "{'features': [10], 'labels': array([False, False, False, ..., False, False, False]), 'ident': 16042, 'group': None}\n", + "{'features': [12], 'labels': array([False, False, False, ..., False, False, False]), 'ident': 17051, 'group': None}\n", + "{'features': [12, 13, 32], 'labels': array([False, False, False, ..., False, False, False]), 'ident': 28741, 'group': None}\n" + ] + } + ], + "source": [ + "for i in range(5):\n", + " print(data_pt[i])" + ] + }, + { + "cell_type": "markdown", + "id": "0d80ffbb-5f1e-4489-9bc8-d688c9be1d07", + "metadata": {}, + "source": [ + "**File Path**: `data/${chebi_version}/${dataset_name}/processed/${reader_name}/data.pt`\n", + "\n", + "\n", + "### Structure of `data.pt`\n", + "\n", + "The `data.pt` file is a list where each element is a dictionary with the following keys:\n", + "\n", + "- **`features`**: \n", + " - **Description**: This key holds the input features for the model. The features are typically stored as tensors and represent the attributes used by the model for training and evaluation.\n", + "\n", + "- **`labels`**: \n", + " - **Description**: This key contains the labels or target values associated with each instance. Labels are also stored as tensors and are used by the model to learn and make predictions.\n", + "\n", + "- **`ident`**: \n", + " - **Description**: This key holds identifiers for each data instance. These identifiers help track and reference the individual samples in the dataset.\n" + ] + }, + { + "cell_type": "markdown", + "id": "186ec6f0eed6ecf7", + "metadata": {}, + "source": [ + "## classes.txt File\n", + "\n", + "**Description**: A file containing the list of selected ChEBI classes based on the specified threshold. This file is crucial for ensuring that only relevant classes are included in the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "8d1fbe6c-beb8-4038-93d4-c56bc7628716", + "metadata": { + "ExecuteTime": { + "end_time": "2024-10-05T21:15:19.146285Z", + "start_time": "2024-10-05T21:15:18.503284Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1722\n", + "2440\n", + "2468\n", + "2571\n", + "2580\n" + ] + } + ], + "source": [ + "with open(os.path.join(chebi_class.processed_dir_main, \"classes.txt\"), \"r\") as file:\n", + " for i in range(5):\n", + " line = file.readline()\n", + " print(line.strip())" + ] + }, + { + "cell_type": "markdown", + "id": "861da1c3-0401-49f0-a22f-109814ed95d5", + "metadata": {}, + "source": [ + "\n", + "**File Path**: `data/${chebi_version}/${dataset_name}/processed/classes.txt`\n", + "\n", + "The `classes.txt` file lists selected ChEBI (Chemical Entities of Biological Interest) classes. These classes are chosen based on a specified threshold, which is typically used for filtering or categorizing the dataset. Each line in the file corresponds to a unique ChEBI class ID, identifying specific chemical entities within the ChEBI ontology.\n", + "\n", + "This file is essential for organizing the data and ensuring that only relevant classes, as defined by the threshold, are included in subsequent processing and analysis tasks.\n" + ] + }, + { + "cell_type": "markdown", + "id": "fb72be449e52b63f", + "metadata": {}, + "source": [ + "## splits.csv File\n", + "\n", + "**Description**: Contains saved data splits from previous runs. During subsequent runs, this file is used to reconstruct the train, validation, and test splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "3ebdcae4-4344-46bd-8fc0-a82ef5d40da5", + "metadata": { + "ExecuteTime": { + "end_time": "2024-10-05T21:15:54.575116Z", + "start_time": "2024-10-05T21:15:53.945139Z" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idsplit
033429train
130151train
217051train
332129train
430340train
\n", + "
" + ], + "text/plain": [ + " id split\n", + "0 33429 train\n", + "1 30151 train\n", + "2 17051 train\n", + "3 32129 train\n", + "4 30340 train" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "csv_df = pd.read_csv(os.path.join(chebi_class.processed_dir_main, \"splits.csv\"))\n", + "csv_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "b058714f-e434-4367-89b9-74c129ac727f", + "metadata": {}, + "source": [ + "\n", + "\n", + "**File Path**: `data/${chebi_version}/${dataset_name}/processed/splits.csv`\n", + "\n", + "The `splits.csv` file contains the saved data splits from previous runs, including the train, validation, and test sets. During subsequent runs, this file is used to reconstruct these splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`. This ensures consistency and reproducibility in data splitting, allowing for reliable evaluation and comparison of model performance across different run.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "6dc3fd6c-7cf6-47ef-812f-54319a0cdeb9", + "metadata": {}, + "outputs": [], + "source": [ + "# You can specify a literal path for the `splits_file_path`, or if another `chebi_class` instance is already defined,\n", + "# you can use its existing `splits_file_path` attribute for consistency.\n", + "chebi_class_with_splits = ChEBIOver50(\n", + " chebi_version=231,\n", + " # splits_file_path=\"data/chebi_v231/ChEBI50/processed/splits.csv\", # Literal path option\n", + " splits_file_path=chebi_class.splits_file_path, # Use path from an existing `chebi_class` instance\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a5eb482c-ce5b-4efc-b2ec-85ac7b1a78ee", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "ab110764-216d-4d52-a9d1-4412c8ac8c9d", + "metadata": {}, + "source": [ + "# 5. Example Molecule: Different Encodings\n", + "\n", + "The `chebai` library supports various encodings for molecules, such as SMILES and SELFIES. In this section, we'll take the example of **benzene** (C₆H₆) and explore its different encodings.\n", + "\n", + "### Overview of Chemical Encodings:\n", + "- **SMILES (Simplified Molecular Input Line Entry System)**: A linear notation for representing molecular structures.\n", + "- **SELFIES (SELF-referencIng Embedded Strings)**: A robust encoding capable of representing a broader range of chemical structures.\n", + "\n", + "### Tokenization and Encoding\n", + "\n", + "To tokenize and numerically encode this chemical encodings, we use specific reader classes, mainly:\n", + "- **ChemDataReader**: For SMILES encoding.\n", + "- **SelfiesReader**: For SELFIES encoding.\n", + "\n", + "There are other implementations too for different variants, you can check out more in the below link.
\n", + "You can explore the implementation of these readers in the source code [here](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/reader.py).\n", + "\n", + "> **Note**: The library uses an `EMBEDDING_OFFSET` of 10 for encoding purposes." + ] + }, + { + "cell_type": "markdown", + "id": "2fa606c5-4d8f-4ca0-89a6-d60f15afe297", + "metadata": {}, + "source": [ + "### 1. **SMILES (Simplified Molecular Input Line Entry System)**\n", + " - **Benzene SMILES**: `c1ccccc1`\n", + " - **Explanation**: \n", + " - The string `c1ccccc1` represents a six-membered aromatic ring, where lowercase `c` indicates aromatic carbon atoms.\n", + " - This encoding provides a compact, human-readable format for molecular structures.\n", + "\n", + "The `ChemDataReader` class is used for SMILES encoding. SMILES tokenization is performed using the `_tokenize` function from the [`pysmiles.read_smiles`](https://github.com/pckroon/pysmiles/blob/master/pysmiles/read_smiles.py) module." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "da47d47e-4560-46af-b246-235596f27d82", + "metadata": {}, + "outputs": [], + "source": [ + "from chebai.preprocessing.reader import ChemDataReader" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "8bdbf309-29ec-4aab-a6dc-9e09bc6961a2", + "metadata": {}, + "outputs": [], + "source": [ + "chem_dr = ChemDataReader()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "68e5c87c-79c3-4d5f-91e6-635399a84d3d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[41, 42, 41, 41, 41, 41, 41, 42]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chem_dr._read_data(\"c1ccccc1\")" + ] + }, + { + "cell_type": "markdown", + "id": "5b7211ee-2ccc-46d3-8e8f-790f344726ba", + "metadata": {}, + "source": [ + "The numbers mentioned above refer to the index of each individual token from the [`tokens.txt`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/bin/smiles_token/tokens.txt) file, which is used by the `ChemDataReader` class. \n", + "\n", + "Each token in the `tokens.txt` file corresponds to a specific symbol or structure in the SMILES encoding, and these tokens are referenced by their index. Additionally, the index values are offset by the `EMBEDDING_OFFSET`, ensuring that the token embeddings are adjusted appropriately during processing." + ] + }, + { + "cell_type": "markdown", + "id": "6f79f0ee-a5d7-427b-b4ac-4a848307917b", + "metadata": {}, + "source": [ + "### 2. **SELFIES (SELF-referencIng Embedded Strings)**\n", + " - **Benzene SELFIES**: `[C][=C][C][=C][C][=C]`\n", + " - **Explanation**: \n", + " - Each `[C]` represents a carbon atom, and `[=C]` represents a carbon atom with a double bond.\n", + " - SELFIES encodes the alternating single and double bonds in benzene's aromatic ring.\n", + "\n", + "The `SelfiesReader` class is used for SELFIES encoding. SELFIES encoding and tokenization are performed using the `encoder` and `split_selfies` functions from the [`selfies`](https://github.com/aspuru-guzik-group/selfies) library.\n", + "\n", + "In the `_read_data` method of `SelfiesReader`, the following steps are carried out:\n", + " 1. The `encoder` function converts the SMILES notation into the SELFIES format.\n", + " 2. The `split_selfies` function then tokenizes the SELFIES string into individual tokens for further processing." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "b23a423e-9447-46e1-a08c-ba164c6877d2", + "metadata": {}, + "outputs": [], + "source": [ + "from chebai.preprocessing.reader import SelfiesReader" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "7408f7c9-0204-444c-b51e-79dc1fcbf497", + "metadata": {}, + "outputs": [], + "source": [ + "selfies_dr = SelfiesReader()" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "b337cef0-f93e-43f8-81ed-def1f5cdeb38", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[25, 29, 25, 29, 25, 29, 30, 32]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "selfies_dr._read_data(\"c1ccccc1\")" + ] + }, + { + "cell_type": "markdown", + "id": "850f4557-7a2e-4c86-a81e-3a41f7a57c12", + "metadata": {}, + "source": [ + "The numbers mentioned above refer to the index of each individual token from the [`tokens.txt`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/bin/selfies/tokens.txt) file, which is used by the `SelfiesReader` class. \n", + "\n", + "Each token in the `tokens.txt` file corresponds to a specific symbol or structure in the SELFIES encoding, and these tokens are referenced by their index. Additionally, the index values are offset by the `EMBEDDING_OFFSET`, ensuring that the token embeddings are adjusted appropriately during processing." + ] + }, + { + "cell_type": "markdown", + "id": "93e328cf-09f9-4694-b175-28320590937d", + "metadata": {}, + "source": [ + "---" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python (env_chebai)", + "language": "python", + "name": "env_chebai" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.14" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/tutorials/data_exploration_go.ipynb b/tutorials/data_exploration_go.ipynb new file mode 100644 index 00000000..6f67c82b --- /dev/null +++ b/tutorials/data_exploration_go.ipynb @@ -0,0 +1,1341 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "da687d32ba48b188", + "metadata": {}, + "source": [ + "# Introduction\n", + "\n", + "This notebook serves as a guide for new developers using the `chebai` package. If you just want to run the experiments, you can refer to the [README.md](https://github.com/ChEB-AI/python-chebai/blob/dev/README.md) and the [wiki](https://github.com/ChEB-AI/python-chebai/wiki) for the basic commands. This notebook explains what happens under the hood for the GO-UniProt dataset. It covers\n", + "- how to instantiate a data class and generate data\n", + "- how the data is processed and stored\n", + "- and how to work with different molecule encodings.\n", + "\n", + "The chebai package simplifies the handling of these datasets by **automatically creating** them as needed. This means that you do not have to input any data manually; the package will generate and organize the data files based on the parameters and encodings selected. This feature ensures that the right data is available and formatted properly. You can however provide your own data files, for instance if you want to replicate a specific experiment.\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "id": "0bd07c91-bb02-48d4-b759-aa35ecb224bd", + "metadata": {}, + "source": [ + "# 1. Instantiation of a Data Class\n", + "\n", + "To start working with `chebai`, you first need to instantiate a GO-UniProt data class. This class is responsible for managing, interacting with, and preprocessing the GO and UniProt data" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "a4d590fb-9a83-456e-9cb4-303caa8203e8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Already in the project root directory: G:\\github-aditya0by0\\python-chebai\n" + ] + } + ], + "source": [ + "# To run this notebook, you need to change the working directory of the jupyter notebook to root dir of the project.\n", + "import os\n", + "\n", + "# Root directory name of the project\n", + "expected_root_dir = \"python-chebai\"\n", + "\n", + "# Check if the current directory ends with the expected root directory name\n", + "if not os.getcwd().endswith(expected_root_dir):\n", + " os.chdir(\"..\") # Move up one directory level\n", + " if os.getcwd().endswith(expected_root_dir):\n", + " print(\"Changed to project root directory:\", os.getcwd())\n", + " else:\n", + " print(\"Warning: Directory change unsuccessful. Current directory:\", os.getcwd())\n", + "else:\n", + " print(\"Already in the project root directory:\", os.getcwd())" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "440f203ceaf7e4b7", + "metadata": { + "ExecuteTime": { + "end_time": "2024-09-30T21:25:03.920610Z", + "start_time": "2024-09-30T21:25:03.622407Z" + } + }, + "outputs": [], + "source": "from chebai.preprocessing.datasets.go_uniprot import GOUniProtOver250" + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "a648346d81d0dc5e", + "metadata": { + "ExecuteTime": { + "end_time": "2024-09-30T21:25:08.863132Z", + "start_time": "2024-09-30T21:25:08.387739Z" + } + }, + "outputs": [], + "source": [ + "go_class = GOUniProtOver250(go_branch=\"BP\")" + ] + }, + { + "cell_type": "markdown", + "id": "64585012b0d7f66f", + "metadata": {}, + "source": [ + "### Inheritance Hierarchy\n", + "\n", + "GO_UniProt data classes inherit from [`_DynamicDataset`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L597), which in turn inherits from [`XYBaseDataModule`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L22). Specifically:\n", + "\n", + "- **`_DynamicDataset`**: This class serves as an intermediate base class that provides additional functionality or customization for datasets that require dynamic behavior. It inherits from `XYBaseDataModule`, which provides the core methods for data loading and processing.\n", + "\n", + "- **`XYBaseDataModule`**: This is the base class for data modules, providing foundational properties and methods for handling and processing datasets, including data splitting, loading, and preprocessing.\n", + "\n", + "In summary, GO_UniProt data classes are designed to manage and preprocess chemical data effectively by leveraging the capabilities provided by `XYBaseDataModule` through the `_DynamicDataset` intermediary.\n", + "\n", + "\n", + "### Configuration Parameters\n", + "\n", + "Data classes related to proteins can be configured using the following main parameters:\n", + "\n", + "- **`go_branch (str)`**: The Gene Ontology (GO) branch. The default value is `\"all\"`, which includes all branches of GO in the dataset.\n", + " - **`\"BP\"`**: Biological Process branch.\n", + " - **`\"MF\"`**: Molecular Function branch.\n", + " - **`\"CC\"`**: Cellular Component branch.\n", + "\n", + "- **`max_sequence_length (int)`**: Specifies the maximum allowed sequence length for a protein, with a default of `1002`. During data preprocessing, any proteins exceeding this length will be excluded from further processing.\n", + "\n", + "This allows for more specific datasets focused on a particular aspect of gene function.\n", + "\n", + "- **`splits_file_path (str, optional)`**: Path to a CSV file containing data splits. If not provided, the class will handle splits internally. The default is `None`.\n", + "\n", + "### Additional Input Parameters\n", + "\n", + "To get more control over various aspects of data loading, processing, and splitting, you can refer to documentation of additional parameters in docstrings of the respective classes: [`_GOUniProtDataExtractor`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/go_uniprot.py#L33), [`XYBaseDataModule`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L22), [`_DynamicDataset`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/base.py#L597), etc.\n", + "\n", + "\n", + "# Available Data Classes\n", + "\n", + "__Note__: Check the code implementation of classes [here](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/datasets/go_uniprot.py).\n", + "\n", + "There is a range of available dataset classes for GOUniProt classes. Usually, you want to use `GOUniProtOver250` or `GOUniProtOver50`. Both inherit from `_GOUniProtOverX`. The number indicates the threshold for selecting label classes. The selection process is based on the annotations of the GO terms with its ancestors across the dataset. For instance, GOUniProtOver50 will only select labels which have at least 50 samples in the dataset.\n", + "\n", + "Refer `select_classes` method of `_GOUniProtOverX` for more details on selection process.\n", + "\n", + "If you need a different threshold, you can create your own subclass." + ] + }, + { + "cell_type": "markdown", + "id": "651ab5c39833bd2c", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "a52b4363-7398-44aa-a4cc-8bba14bdd966", + "metadata": {}, + "source": [ + "# 2. Preparation / Setup Methods\n", + "\n", + "Once a GOUniProt data class instance is created, it typically requires preparation before use. This step is to generate the actual dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "9f77351090560bc4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Checking for processed data in data\\GO_UniProt\\GO250_BP_1002\\processed\n", + "Missing processed data file (`data.pkl` file)\n", + "Downloading Swiss UniProt data....\n", + "Downloading to temporary file C:\\Users\\HP\\AppData\\Local\\Temp\\tmp7pp677ik\n", + "Downloaded to C:\\Users\\HP\\AppData\\Local\\Temp\\tmp7pp677ik\n", + "Unzipping the file....\n", + "Unpacked and saved to data\\GO_UniProt\\raw\\uniprot_sprot.dat\n", + "Removed temporary file C:\\Users\\HP\\AppData\\Local\\Temp\\tmp7pp677ik\n", + "Missing Gene Ontology raw data\n", + "Downloading Gene Ontology data....\n", + "Extracting class hierarchy...\n", + "Compute transitive closure\n", + "Processing graph\n", + "Parsing swiss uniprot raw data....\n", + "Selecting GO terms based on given threshold: 250 ...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Check for processed data in data\\GO_UniProt\\GO250_BP_1002\\processed\\protein_token\n", + "Cross-validation enabled: False\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Missing transformed data (`data.pt` file). Transforming data.... \n", + "Processing 53604 lines...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|███████████████████████████████████████████████████████████████████████████| 53604/53604 [01:18<00:00, 678.84it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Saving 20 tokens to G:\\github-aditya0by0\\python-chebai\\chebai\\preprocessing\\bin\\protein_token\\tokens.txt...\n", + "First 10 tokens: ['M', 'S', 'I', 'G', 'A', 'T', 'R', 'L', 'Q', 'N']\n" + ] + } + ], + "source": [ + "go_class.prepare_data()\n", + "go_class.setup()" + ] + }, + { + "cell_type": "markdown", + "id": "2328e824c4dafb2d", + "metadata": {}, + "source": [ + "### Automatic Execution: \n", + "These methods are executed automatically within the data class instance. Users do not need to call them explicitly, as the code internally manages the preparation and setup of data, ensuring that it is ready for subsequent use in training and validation processes.\n", + "\n", + "\n", + "### Why is Preparation Needed?\n", + "\n", + "- **Data Availability**: The preparation step ensures that the required GOUniProt data files are downloaded or loaded, which are essential for analysis.\n", + "- **Data Integrity**: It ensures that the data files are transformed into a compatible format required for model input.\n", + "\n", + "### Main Methods for Data Preprocessing\n", + "\n", + "The data preprocessing in a data class involves two main methods:\n", + "\n", + "1. **`prepare_data` Method**:\n", + " - **Purpose**: This method checks for the presence of raw data in the specified directory. If the raw data is missing, it fetches the ontology, creates a dataframe, and saves it to a file (`data.pkl`). The dataframe includes columns such as IDs, data representations, and labels.\n", + " - **Documentation**: [PyTorch Lightning - `prepare_data`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#prepare-data)\n", + "\n", + "2. **`setup` Method**:\n", + " - **Purpose**: This method sets up the data module for training, validation, and testing. It checks for the processed data and, if necessary, performs additional setup to ensure the data is ready for model input. It also handles cross-validation settings if enabled.\n", + " - **Description**: Transforms `data.pkl` into a model input data format (`data.pt`), ensuring that the data is in a format compatible for input to the model. The transformed data contains the following keys: `ident`, `features`, `labels`, and `group`. This method uses a subclass of Data Reader to perform the transformation.\n", + " - **Documentation**: [PyTorch Lightning - `setup`](https://lightning.ai/docs/pytorch/stable/data/datamodule.html#setup)\n", + "\n", + "These methods ensure that the data is correctly prepared and set up for subsequent use in training and validation processes." + ] + }, + { + "cell_type": "markdown", + "id": "db5b58f2d96823fc", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "ee174b61b36c71aa", + "metadata": {}, + "source": [ + "# 3. Overview of the 3 preprocessing stages\n", + "\n", + "The `chebai` library follows a three-stage preprocessing pipeline, which is reflected in its file structure:\n", + "\n", + "1. **Raw Data Stage**:\n", + " - **File**: `go-basic.obo` and `uniprot_sprot.data`\n", + " - **Description**: This stage contains the raw GO ontology data and raw Swiss-UniProt data, serving as the initial input for further processing.\n", + " - **File Paths**:\n", + " - `data/GO_UniProt/raw/go-basic.obo`\n", + " - `data/GO_UniProt/raw/uniprot_sprot.dat`\n", + "\n", + "2. **Processed Data Stage 1**:\n", + " - **File**: `data.pkl`\n", + " - **Description**: This stage includes the data after initial processing. It contains sequence strings, class columns, and metadata but lacks data splits.\n", + " - **File Path**: `data/GO_UniProt/${dataset_name}/processed/data.pkl`\n", + " - **Additional File**: `classes.txt` - A file listing the relevant ChEBI classes.\n", + "\n", + "3. **Processed Data Stage 2**:\n", + " - **File**: `data.pt`\n", + " - **Description**: This final stage includes the encoded data in a format compatible with PyTorch, ready for model input. This stage also references data splits when available.\n", + " - **File Path**: `data/GO_UniProt/${dataset_name}/processed/${reader_name}/data.pt`\n", + " - **Additional File**: `splits.csv` - Contains saved splits for reproducibility.\n", + "\n", + "**Note**: If `go_branch` is specified, the `dataset_name` will include the branch name in the format `${dataset_name}_${go_branch}`. Otherwise, it will just be `${dataset_name}`.\n", + "\n", + "### Summary of File Paths\n", + "\n", + "- **Raw Data**: `data/GO_UniProt/raw`\n", + "- **Processed Data 1**: `data/GO_UniProt/${dataset_name}/processed`\n", + "- **Processed Data 2**: `data/GO_UniProt/${dataset_name}/processed/${reader_name}`\n", + "\n", + "This structured approach to data management ensures that each stage of data processing is well-organized and documented, from raw data acquisition to the preparation of model-ready inputs. It also facilitates reproducibility and traceability across different experiments.\n", + "\n", + "### Data Splits\n", + "\n", + "- **Creation**: Data splits are generated dynamically \"on the fly\" during training and evaluation to ensure flexibility and adaptability to different tasks.\n", + "- **Reproducibility**: To maintain consistency across different runs, splits can be reproduced by comparing hashes with a fixed seed value.\n" + ] + }, + { + "cell_type": "markdown", + "id": "a927ad484c930960", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "3f92b58e460c08fd", + "metadata": {}, + "source": [ + "# 4. Data Files and their structure\n", + "\n", + "`chebai` creates and manages several data files during its operation. These files store various chemical data and metadata essential for different tasks. Let’s explore these files and their content.\n" + ] + }, + { + "cell_type": "markdown", + "id": "cca75d881cb8bade", + "metadata": {}, + "source": [ + "## go-basic.obo File\n", + "\n", + "**Description**: The `go-basic.obo` file is a key resource in the Gene Ontology (GO) dataset, containing the ontology data that defines various biological processes, molecular functions, and cellular components, as well as their relationships. This file is downloaded directly from the Gene Ontology Consortium and serves as the foundational raw data for further processing in GO-based applications.\n", + "\n", + "#### Example of a Term Document\n", + "\n", + "```plaintext\n", + "[Term]\n", + "id: GO:0000032\n", + "name: cell wall mannoprotein biosynthetic process\n", + "namespace: biological_process\n", + "def: \"The chemical reactions and pathways resulting in the formation of cell wall mannoproteins, any cell wall protein that contains covalently bound mannose residues.\" [GOC:ai]\n", + "synonym: \"cell wall mannoprotein anabolism\" EXACT []\n", + "is_a: GO:0006057 ! mannoprotein biosynthetic process\n", + "is_a: GO:0031506 ! cell wall glycoprotein biosynthetic process\n", + "```\n", + "\n", + "**File Path**: `data/GO_UniProt/raw/go-basic.obo`\n", + "\n", + "### Structure of `go-basic.obo`\n", + "\n", + "The `go-basic.obo` file is organized into blocks of text known as \"term documents.\" Each block starts with a `[Term]` header and contains various attributes that describe a specific biological process, molecular function, or cellular component within the GO ontology. These attributes include identifiers, names, relationships to other terms, and more.\n", + "\n", + "\n", + "\n", + "### Breakdown of Attributes\n", + "\n", + "Each term document in the `go-basic.obo` file consists of the following key attributes:\n", + "\n", + "- **`[Term]`**: \n", + " - **Description**: Indicates the beginning of a new term in the ontology. Each term represents a distinct biological process, molecular function, or cellular component.\n", + "\n", + "- **`id: GO:0000032`**: \n", + " - **Description**: A unique identifier for the biological term within the GO ontology.\n", + " - **Example**: `GO:0000032` refers to the term \"cell wall mannoprotein biosynthetic process.\"\n", + "\n", + "- **`name: cell wall mannoprotein biosynthetic process`**: \n", + " - **Description**: The name of the biological process, molecular function, or cellular component being described.\n", + " - **Example**: The name \"cell wall mannoprotein biosynthetic process\" is a descriptive label for the GO term with the identifier `GO:0000032`.\n", + "\n", + "- **`namespace: biological_process`**: \n", + " - **Description**: Specifies which ontology the term belongs to. The main namespaces are `biological_process`, `molecular_function`, and `cellular_component`.\n", + "\n", + "- **`is_a: GO:0006057`**: \n", + " - **Description**: Defines hierarchical relationships to other terms within the ontology. The `is_a` attribute indicates that the current term is a subclass or specific instance of the referenced term.\n", + " - **Example**: The term `GO:0000032` (\"cell wall mannoprotein biosynthetic process\") is a subclass of `GO:0006057` and subclass of `GO:0031506`.\n" + ] + }, + { + "cell_type": "markdown", + "id": "87c841de7d80beef", + "metadata": {}, + "source": [ + "## uniprot_sprot.dat File\n", + "\n", + "**Description**: The `uniprot_sprot.dat` file is a key component of the UniProtKB/Swiss-Prot dataset. It contains curated protein sequences with detailed annotations. Each entry in the file corresponds to a reviewed protein sequence, complete with metadata about its biological function, taxonomy, gene name, cross-references to other databases, and more. Below is a breakdown of the structure and key attributes in the file, using the provided example.\n", + "\n", + "\n", + "### Example of a Protein Entry\n", + "\n", + "```plaintext\n", + "ID 002L_FRG3G Reviewed; 320 AA.\n", + "AC Q6GZX3;\n", + "DT 28-JUN-2011, integrated into UniProtKB/Swiss-Prot.\n", + "DT 19-JUL-2004, sequence version 1.\n", + "DT 08-NOV-2023, entry version 46.\n", + "DE RecName: Full=Uncharacterized protein 002L;\n", + "GN ORFNames=FV3-002L;\n", + "OS Frog virus 3 (isolate Goorha) (FV-3).\n", + "OC Viruses; Varidnaviria; Bamfordvirae; Nucleocytoviricota; Megaviricetes;\n", + "OX NCBI_TaxID=654924;\n", + "OH NCBI_TaxID=8404; Lithobates pipiens (Northern leopard frog) (Rana pipiens).\n", + "RN [1]\n", + "RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].\n", + "RX PubMed=15165820; DOI=10.1016/j.virol.2004.02.019;\n", + "RA Tan W.G., Barkman T.J., Gregory Chinchar V., Essani K.;\n", + "RT \"Comparative genomic analyses of frog virus 3, type species of the genus\n", + "RT Ranavirus (family Iridoviridae).\";\n", + "RL Virology 323:70-84(2004).\n", + "CC -!- SUBCELLULAR LOCATION: Host membrane {ECO:0000305}; Single-pass membrane\n", + "CC protein {ECO:0000305}.\n", + "DR EMBL; AY548484; AAT09661.1; -; Genomic_DNA.\n", + "DR RefSeq; YP_031580.1; NC_005946.1.\n", + "DR GeneID; 2947774; -.\n", + "DR KEGG; vg:2947774; -.\n", + "DR Proteomes; UP000008770; Segment.\n", + "DR GO; GO:0033644; C:host cell membrane; IEA:UniProtKB-SubCell.\n", + "DR GO; GO:0016020; C:membrane; IEA:UniProtKB-KW.\n", + "PE 4: Predicted;\n", + "KW Host membrane; Membrane; Reference proteome; Transmembrane;\n", + "KW Transmembrane helix.\n", + "FT CHAIN 1..320\n", + "FT /note=\"Uncharacterized protein 002L\"\n", + "FT /id=\"PRO_0000410509\"\n", + "SQ SEQUENCE 320 AA; 34642 MW; 9E110808B6E328E0 CRC64;\n", + " MSIIGATRLQ NDKSDTYSAG PCYAGGCSAF TPRGTCGKDW DLGEQTCASG FCTSQPLCAR\n", + " IKKTQVCGLR YSSKGKDPLV SAEWDSRGAP YVRCTYDADL IDTQAQVDQF VSMFGESPSL\n", + " AERYCMRGVK NTAGELVSRV SSDADPAGGW CRKWYSAHRG PDQDAALGSF CIKNPGAADC\n", + " KCINRASDPV YQKVKTLHAY PDQCWYVPCA ADVGELKMGT QRDTPTNCPT QVCQIVFNML\n", + " DDGSVTMDDV KNTINCDFSK YVPPPPPPKP TPPTPPTPPT PPTPPTPPTP PTPRPVHNRK\n", + " VMFFVAGAVL VAILISTVRW\n", + "//\n", + "```\n", + "\n", + "**File Path**: `data/GO_UniProt/raw/uniprot_sprot.dat`\n", + "\n", + "\n", + "## Structure of `uniprot_sprot.dat`\n", + "\n", + "The `uniprot_sprot.dat` file is organized into blocks of text, each representing a single protein entry. These blocks contain specific tags and fields that describe different aspects of the protein, including its sequence, function, taxonomy, and cross-references to external databases.\n", + "\n", + "### Breakdown of Attributes\n", + "\n", + "Each protein entry in the `uniprot_sprot.dat` file is structured with specific tags and sections that describe the protein in detail. Here's a breakdown of the key attributes:\n", + "\n", + "- **`ID`**: \n", + " - **Description**: Contains the unique identifier for the protein and its status (e.g., `Reviewed` indicates the sequence has been manually curated).\n", + " - **Example**: `002L_FRG3G` is the identifier for the protein from Frog virus 3.\n", + "\n", + "- **`AC`**: \n", + " - **Description**: Accession number, a unique identifier for the protein sequence.\n", + " - **Example**: `Q6GZX3` is the accession number for this entry.\n", + "\n", + "- **`DR`**: \n", + " - **Description**: Cross-references to other databases like EMBL, RefSeq, KEGG, and GeneID.\n", + " - **Example**: This entry is cross-referenced with the EMBL database, RefSeq, GO, etc.\n", + "\n", + "- **`GO`**: \n", + " - **Description**: Gene Ontology annotations that describe the cellular component, biological process, or molecular function associated with the protein.\n", + " - **Example**: The protein is associated with the GO terms `GO:0033644` (host cell membrane) and `GO:0016020` (membrane).\n", + "\n", + "- **`SQ`**: \n", + " - **Description**: The amino acid sequence of the protein.\n", + " - **Example**: The sequence consists of 320 amino acids.\n", + "\n", + "__Note__: For more detailed information refer [here](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/keywlist.txt\n", + "). \n", + "\n", + "Consider the below line from above example: \n", + "```plaintext\n", + "DR GO; GO:0033644; C:host cell membrane; IEA:UniProtKB-SubCell.\n", + "```\n", + "\n", + "The line contains a **Gene Ontology (GO) annotation** describing the protein's subcellular location. Here's a detailed breakdown:\n", + "\n", + "- **`GO:0033644`**: This is the specific **GO term** identifier for \"host cell membrane,\" which indicates that the protein is associated with or located at the membrane of the host cell.\n", + "\n", + "- **`IEA`**: This stands for **Inferred from Electronic Annotation**, which is part of the **GO Evidence Codes**. **IEA** indicates that the annotation was automatically generated based on computational methods rather than direct experimental evidence. While **IEA** annotations are useful, they are generally considered less reliable than manually curated or experimentally verified evidence codes.\n", + "\n", + "__Note__: For more details on evidence codes check section 5.2" + ] + }, + { + "cell_type": "markdown", + "id": "b7687078-f6b8-4fbf-afa7-dfda89061a5e", + "metadata": {}, + "source": [ + "## data.pkl File\n", + "\n", + "**Description**: This file is generated by the `prepare_data` method and contains the processed GO data in a dataframe format. It includes protein IDs, data representations (such as sequence strings), and class columns with boolean values." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "b4da7e73e251e1d1", + "metadata": { + "ExecuteTime": { + "end_time": "2024-09-30T14:08:33.990378Z", + "start_time": "2024-09-30T14:08:33.959459Z" + } + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import os" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "b66fbb9b720d053c", + "metadata": { + "ExecuteTime": { + "end_time": "2024-09-30T14:10:12.796911Z", + "start_time": "2024-09-30T14:10:06.052276Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Size of the data (rows x columns): (53604, 902)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
swiss_idaccessiongo_idssequence4175122165209226...1990778200002620001452000146200014720002412000243200114120012332001234
111S1_CARILB5KVH4[3006, 8150, 9791, 10431, 21700, 22414, 32501,...MAKPILLSIYLCLIIVALFNGCLAQSGGRQQHKFGQCQLNRLDALE...FalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
311S2_SESINQ9XHP0[3006, 8150, 10431, 21700, 22414, 32502, 48609]MVAFKFLLALSLSLLVSAAIAQTREPRLTQGQQCRFQRISGAQPSL...FalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
614310_ARATHP48347,Q9LME5[7165, 8150, 9742, 9755, 9987, 43401, 50789, 5...MENEREKQVYLAKLSEQTERYDEMVEAMKKVAQLDVELTVEERNLV...FalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
814331_ARATHP42643,Q945M2,Q9M0S7[8150, 19222, 50789, 65007]MATPGASSARDEFVYMAKLAEQAERYEEMVEFMEKVAKAVDKDELT...FalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
914331_CAEELP41932,Q21537[132, 226, 1708, 6611, 6810, 6886, 6913, 6950,...MSDTVEELVQRAKLAEQAERYDDMAAAMKKVTEQGQELSNEERNLL...FalseFalseFalseFalseFalseTrue...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", + "

5 rows × 902 columns

\n", + "
" + ], + "text/plain": [ + " swiss_id accession \\\n", + "1 11S1_CARIL B5KVH4 \n", + "3 11S2_SESIN Q9XHP0 \n", + "6 14310_ARATH P48347,Q9LME5 \n", + "8 14331_ARATH P42643,Q945M2,Q9M0S7 \n", + "9 14331_CAEEL P41932,Q21537 \n", + "\n", + " go_ids \\\n", + "1 [3006, 8150, 9791, 10431, 21700, 22414, 32501,... \n", + "3 [3006, 8150, 10431, 21700, 22414, 32502, 48609] \n", + "6 [7165, 8150, 9742, 9755, 9987, 43401, 50789, 5... \n", + "8 [8150, 19222, 50789, 65007] \n", + "9 [132, 226, 1708, 6611, 6810, 6886, 6913, 6950,... \n", + "\n", + " sequence 41 75 122 \\\n", + "1 MAKPILLSIYLCLIIVALFNGCLAQSGGRQQHKFGQCQLNRLDALE... False False False \n", + "3 MVAFKFLLALSLSLLVSAAIAQTREPRLTQGQQCRFQRISGAQPSL... False False False \n", + "6 MENEREKQVYLAKLSEQTERYDEMVEAMKKVAQLDVELTVEERNLV... False False False \n", + "8 MATPGASSARDEFVYMAKLAEQAERYEEMVEFMEKVAKAVDKDELT... False False False \n", + "9 MSDTVEELVQRAKLAEQAERYDDMAAAMKKVTEQGQELSNEERNLL... False False False \n", + "\n", + " 165 209 226 ... 1990778 2000026 2000145 2000146 2000147 \\\n", + "1 False False False ... False False False False False \n", + "3 False False False ... False False False False False \n", + "6 False False False ... False False False False False \n", + "8 False False False ... False False False False False \n", + "9 False False True ... False False False False False \n", + "\n", + " 2000241 2000243 2001141 2001233 2001234 \n", + "1 False False False False False \n", + "3 False False False False False \n", + "6 False False False False False \n", + "8 False False False False False \n", + "9 False False False False False \n", + "\n", + "[5 rows x 902 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pkl_df = pd.DataFrame(\n", + " pd.read_pickle(\n", + " os.path.join(\n", + " go_class.processed_dir_main,\n", + " go_class.processed_dir_main_file_names_dict[\"data\"],\n", + " )\n", + " )\n", + ")\n", + "print(\"Size of the data (rows x columns): \", pkl_df.shape)\n", + "pkl_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "735844f0b2474ad6", + "metadata": {}, + "source": [ + "**File Path**: `data/GO_UniProt/${dataset_name}/processed/data.pkl`\n", + "\n", + "\n", + "### Structure of `data.pkl`\n", + "`data.pkl` as following structure: \n", + "- **Column 0**: Contains the Identifier from Swiss-UniProt Dataset for each Swiss Protein data instance.\n", + "- **Column 1**: Contains the accession of each Protein data instance.\n", + "- **Column 2**: Contains the list of GO-IDs (Identifiers from Gene Ontology) which maps each Swiss Protein to the Gene Ontology instance.\n", + "- **Column 3**: Contains the sequence representation for the Swiss Protein using Amino Acid notation.\n", + "- **Column 4 and onwards**: Contains the labels, starting from column 4.\n", + "\n", + "This structure ensures that the data is organized and ready for further processing, such as further encoding.\n" + ] + }, + { + "cell_type": "markdown", + "id": "2c9b17f6-93bd-4cc3-8967-7ab1d2e06e51", + "metadata": {}, + "source": [ + "## data.pt File\n", + "\n", + "**Description**: Generated by the `setup` method, this file contains encoded data in a format compatible with the PyTorch library. It includes keys such as `ident`, `features`, `labels`, and `group`, making it ready for model input." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "85b097601fb242d6", + "metadata": { + "ExecuteTime": { + "end_time": "2024-09-30T14:10:35.034002Z", + "start_time": "2024-09-30T14:10:35.018342Z" + } + }, + "outputs": [], + "source": [ + "import torch" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "289a54a71dec20fb", + "metadata": { + "ExecuteTime": { + "end_time": "2024-09-30T14:11:36.443693Z", + "start_time": "2024-09-30T14:11:34.199285Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Type of loaded data: \n", + "Content of the data file: \n", + " {'features': [10, 14, 21, 23, 12, 17, 17, 11, 12, 22, 17, 24, 17, 12, 12, 28, 14, 17, 25, 19, 13, 24, 17, 14, 18, 11, 13, 13, 16, 18, 18, 29, 21, 25, 13, 18, 24, 18, 17, 19, 16, 17, 20, 14, 17, 27, 23, 15, 19, 16, 12, 27, 14, 27, 14, 13, 28, 12, 27, 11, 26, 20, 23, 19, 29, 18, 18, 17, 18, 24, 14, 13, 28, 14, 28, 28, 16, 16, 15, 12, 27, 23, 19, 13, 17, 17, 17, 23, 29, 22, 11, 19, 14, 23, 18, 17, 28, 22, 12, 14, 16, 13, 16, 13, 12, 15, 13, 28, 17, 25, 23, 13, 24, 23, 27, 15, 25, 27, 27, 11, 18, 16, 18, 11, 18, 18, 13, 18, 16, 16, 27, 25, 18, 18, 20, 16, 29, 18, 21, 12, 16, 29, 25, 16, 27, 13, 20, 12, 12, 14, 25, 23, 14, 13, 28, 14, 29, 26, 24, 22, 19, 20, 13, 11, 11, 23, 28, 28, 14, 12, 25, 17, 17, 20, 15, 29, 19, 19, 14, 19, 18, 17, 20, 18, 19, 23, 16, 19, 25, 22, 17, 14, 13, 19, 23, 20, 20, 27, 25, 16, 23, 18, 13, 18, 18, 27, 22, 27, 18, 29, 16, 16, 18, 18, 18, 29, 18, 18, 16, 16, 13, 27, 29, 13, 27, 18, 18, 16, 20, 17, 13, 19, 19, 28, 25, 11, 13, 25, 20, 14, 27, 25, 17, 14, 20, 14, 25, 19, 28, 20, 15, 27, 15, 14, 16, 16, 17, 18, 11, 27, 19, 20, 29, 16, 13, 11, 12, 28, 16, 28, 27, 13, 16, 18, 17, 18, 28, 12, 16, 23, 16, 26, 11, 16, 27, 27, 18, 27, 29, 27, 27, 16, 21, 27, 16, 27, 16, 27, 16, 27, 11, 27, 11, 27, 16, 16, 18, 11, 16, 16, 13, 13, 16, 20, 20, 19, 13, 17, 27, 27, 15, 12, 24, 15, 17, 11, 17, 16, 27, 19, 12, 13, 20, 23, 11, 16, 14, 20, 12, 22, 15, 27, 27, 14, 13, 16, 12, 11, 15, 28, 19, 11, 29, 19, 17, 23, 12, 17, 16, 26, 17, 18, 17, 11, 14, 27, 16, 13, 14, 17, 22, 11, 20, 14, 17, 22, 28, 23, 29, 26, 19, 17, 19, 14, 29, 11, 28, 28, 22, 14, 17, 16, 13, 16, 14, 27, 28, 18, 28, 28, 20, 19, 25, 13, 18, 15, 28, 25, 20, 20, 27, 17, 16, 27, 13, 18, 17, 17, 15, 12, 23, 18, 19, 25, 14, 28, 28, 21, 16, 14, 16, 20, 27, 13, 25, 27, 26, 28, 11, 25, 21, 15, 19, 27, 19, 14, 10, 28, 11, 23, 17, 14, 13, 16, 15, 11, 14, 12, 16, 14, 17, 23, 27, 27, 28, 17, 28, 19, 14, 25, 18, 12, 23, 16, 27, 20, 14, 16, 16, 17, 21, 25, 19, 16, 18, 27, 11, 15, 17, 28, 16, 11, 16, 11, 16, 11, 11, 16, 11, 27, 16, 16, 14, 27, 28], 'labels': array([False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, True, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, True, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, True, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, True, False, False, False, False, False, True,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, True, True, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, True,\n", + " True, False, False, False, False, False, False, False, False,\n", + " True, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False]), 'ident': '11S1_CARIL', 'group': None}\n" + ] + } + ], + "source": [ + "data_pt = torch.load(\n", + " os.path.join(go_class.processed_dir, go_class.processed_file_names_dict[\"data\"]),\n", + " weights_only=False,\n", + ")\n", + "print(\"Type of loaded data:\", type(data_pt))\n", + "print(\"Content of the data file: \\n\", data_pt[0])" + ] + }, + { + "cell_type": "markdown", + "id": "2c9f23883c66b48d", + "metadata": {}, + "source": [ + "**File Path**: `data/GO_UniProt/${dataset_name}/processed/${reader_name}/data.pt`\n", + "\n", + "The `data.pt` file is a list where each element is a dictionary with the following keys:\n", + "\n", + "- **`features`**: \n", + " - **Description**: This key holds the input features for the model. The features are typically stored as tensors and represent the attributes used by the model for training and evaluation.\n", + "\n", + "- **`labels`**: \n", + " - **Description**: This key contains the labels or target values associated with each instance. Labels are also stored as tensors and are used by the model to learn and make predictions.\n", + "\n", + "- **`ident`**: \n", + " - **Description**: This key holds identifiers for each data instance. These identifiers help track and reference the individual samples in the dataset.\n" + ] + }, + { + "cell_type": "markdown", + "id": "36aed0b8-ab05-428d-8833-2a24deebacc3", + "metadata": {}, + "source": [ + "## classes.txt File\n", + "\n", + "**Description**: This file lists the GO classes that are used as labels. It can be used to match labels in `data.pt` with GO classes: For position `i` in the label-tensor, the GO-ID is in line `i` of `classes.txt`" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "19200f7ff9a6ebba", + "metadata": { + "ExecuteTime": { + "end_time": "2024-09-30T21:30:34.344202Z", + "start_time": "2024-09-30T21:30:34.328318Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "41\n", + "75\n", + "122\n", + "165\n", + "209\n" + ] + } + ], + "source": [ + "with open(os.path.join(go_class.processed_dir_main, \"classes.txt\"), \"r\") as file:\n", + " for i in range(5):\n", + " line = file.readline()\n", + " print(line.strip())" + ] + }, + { + "cell_type": "markdown", + "id": "f69012b3540fd1b6", + "metadata": {}, + "source": [ + "**File Path**: `data/GO_UniProt/${dataset_name}/processed/classes.txt`\n", + "\n", + "The `classes.txt` file lists selected GO classes. These classes are chosen based on a specified threshold, which is typically used for filtering or categorizing the dataset. Each line in the file corresponds to a unique Swiss Protein class ID, identifying specific protein from Swiss-UniProt dataset." + ] + }, + { + "cell_type": "markdown", + "id": "b81ea34f-cfa8-4ffa-8b88-b54ca96afd84", + "metadata": {}, + "source": [ + "## splits.csv File\n", + "\n", + "**Description**: This file contains saved data splits from previous runs. During subsequent runs, it is used to reconstruct the train, validation, and test splits by filtering the encoded data (`data.pt`) based on the IDs stored in `splits.csv`." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "88c3ea8f01ba9fac", + "metadata": { + "ExecuteTime": { + "end_time": "2024-09-30T21:30:41.586616Z", + "start_time": "2024-09-30T21:30:39.318598Z" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idsplit
014331_ARATHtrain
114331_CAEELtrain
214331_MAIZEtrain
314332_MAIZEtrain
414333_ARATHtrain
\n", + "
" + ], + "text/plain": [ + " id split\n", + "0 14331_ARATH train\n", + "1 14331_CAEEL train\n", + "2 14331_MAIZE train\n", + "3 14332_MAIZE train\n", + "4 14333_ARATH train" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "csv_df = pd.read_csv(os.path.join(go_class.processed_dir_main, \"splits.csv\"))\n", + "csv_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "6661dc11247e9753", + "metadata": {}, + "source": [ + "**File Path**: `data/GO_UniProt/${dataset_name}/processed/splits.csv`\n", + "\n", + "To reuse an existing split, you can use the `splits_file_path` argument. This way, you can reuse the same datasplit across several runs." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "2b02d8b4-c2de-4b8e-b680-ec67b40d9a30", + "metadata": {}, + "outputs": [], + "source": [ + "# You can specify a literal path for the `splits_file_path`, or if another `go_class` instance is already defined,\n", + "# you can use its existing `splits_file_path` attribute for consistency.\n", + "go_class_with_splits = GOUniProtOver250(\n", + " go_branch=\"BP\",\n", + " # splits_file_path=\"data/GO_UniProt/GO250_BP_1002/processed/splits.csv\", # Literal path option\n", + " splits_file_path=go_class.splits_file_path, # Use path from an existing `go_class` instance\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e6b1f184a5091b83", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "481b8c0271ec9636", + "metadata": {}, + "source": [ + "## 5.1 Protein Representation Using Amino Acid Sequence Notation\n", + "\n", + "Proteins are composed of chains of amino acids, and these sequences can be represented using a one-letter notation for each amino acid. This notation provides a concise way to describe the primary structure of a protein.\n", + "\n", + "### Example Protein Sequence\n", + "\n", + "Protein: **Lysozyme C** from **Gallus gallus** (Chicken). \n", + "[Lysozyme C - UniProtKB P00698](https://www.uniprot.org/uniprotkb/P00698/entry#function)\n", + "\n", + "- **Sequence**: `MRSLLILVLCFLPLAALGKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL`\n", + "- **Sequence Length**: 147\n", + "\n", + "In this sequence, each letter corresponds to a specific amino acid. This notation is widely used in bioinformatics and molecular biology to represent protein sequences.\n", + "\n", + "### Tokenization and Encoding\n", + "\n", + "To tokenize and numerically encode this protein sequence, the `ProteinDataReader` class is used. This class allows for n-gram tokenization, where the `n_gram` parameter defines the size of the tokenized units. If `n_gram` is not provided (default is `None`), each amino acid letter is treated as a single token.\n", + "\n", + "For more details, you can explore the implementation of the `ProteinDataReader` class in the source code [here](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/reader.py)." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "e0cf4fb6-2ca4-4b85-a4e7-0cfbac5cd6c1", + "metadata": {}, + "outputs": [], + "source": [ + "from chebai.preprocessing.reader import ProteinDataReader" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "e8343d83-0be3-44df-9224-bba8d5c32336", + "metadata": {}, + "outputs": [], + "source": [ + "protein_dr_3gram = ProteinDataReader(n_gram=3)\n", + "protein_dr = ProteinDataReader()" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "8a18dc27-f308-4dde-b1ae-b03a20fb0d45", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[10, 16, 11, 17, 17, 12, 17, 28, 17, 24, 25, 17, 23, 17, 14, 14, 17, 13, 21]\n", + "[30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46]\n" + ] + } + ], + "source": [ + "protein = \"MRSLLILVLCFLPLAALGK\"\n", + "print(protein_dr._read_data(protein))\n", + "print(protein_dr_3gram._read_data(protein))" + ] + }, + { + "cell_type": "markdown", + "id": "7e95738a-0b2d-4c56-ac97-f3b24c1de18f", + "metadata": {}, + "source": [ + "The numbers mentioned above refer to the index of each individual token from the [`tokens.txt`](https://github.com/ChEB-AI/python-chebai/blob/dev/chebai/preprocessing/bin/protein_token/tokens.txt) file, which is used by the `ProteinDataReader` class. \n", + "\n", + "Each token in the `tokens.txt` file corresponds to a specific amino-acid letter, and these tokens are referenced by their index. Additionally, the index values are offset by the `EMBEDDING_OFFSET`, ensuring that the token embeddings are adjusted appropriately during processing." + ] + }, + { + "cell_type": "markdown", + "id": "fd54ca4a-743c-496e-9e89-cff2d8226eb2", + "metadata": {}, + "source": [ + "### The 20 Amino Acids and Their One-Letter Notations\n", + "\n", + "Here is a list of the 20 standard amino acids, along with their one-letter notations and descriptions:\n", + "\n", + "| One-Letter Notation | Amino Acid Name | Description |\n", + "|---------------------|----------------------|---------------------------------------------------------|\n", + "| **A** | Alanine | Non-polar, aliphatic amino acid. |\n", + "| **C** | Cysteine | Polar, contains a thiol group, forms disulfide bonds. |\n", + "| **D** | Aspartic Acid | Acidic, negatively charged at physiological pH. |\n", + "| **E** | Glutamic Acid | Acidic, negatively charged at physiological pH. |\n", + "| **F** | Phenylalanine | Aromatic, non-polar. |\n", + "| **G** | Glycine | Smallest amino acid, non-polar. |\n", + "| **H** | Histidine | Polar, positively charged, can participate in enzyme active sites. |\n", + "| **I** | Isoleucine | Non-polar, aliphatic. |\n", + "| **K** | Lysine | Basic, positively charged at physiological pH. |\n", + "| **L** | Leucine | Non-polar, aliphatic. |\n", + "| **M** | Methionine | Non-polar, contains sulfur, start codon in mRNA translation. |\n", + "| **N** | Asparagine | Polar, uncharged. |\n", + "| **P** | Proline | Non-polar, introduces kinks in protein chains. |\n", + "| **Q** | Glutamine | Polar, uncharged. |\n", + "| **R** | Arginine | Basic, positively charged, involved in binding phosphate groups. |\n", + "| **S** | Serine | Polar, can be phosphorylated. |\n", + "| **T** | Threonine | Polar, can be phosphorylated. |\n", + "| **V** | Valine | Non-polar, aliphatic. |\n", + "| **W** | Tryptophan | Aromatic, non-polar, largest amino acid. |\n", + "| **Y** | Tyrosine | Aromatic, polar, can be phosphorylated. |\n", + "\n", + "### Understanding Protein Sequences\n", + "\n", + "In the example sequence, each letter represents one of the above amino acids. The sequence reflects the specific order of amino acids in the protein, which is critical for its structure and function.\n", + "\n", + "This notation is used extensively in various bioinformatics tools and databases to study protein structure, function, and interactions.\n", + "\n", + "\n", + "_Note_: Refer for amino acid sequence: https://en.wikipedia.org/wiki/Protein_primary_structure" + ] + }, + { + "cell_type": "markdown", + "id": "db6d7f2cc446e6f9", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "7f42b928364e5cd1", + "metadata": {}, + "source": [ + "## 5.2 More on GO Evidence Codes\n", + "\n", + "The **Gene Ontology (GO) Evidence Codes** provide a way to indicate the level of evidence supporting a GO annotation. Here's a list of the GO evidence codes with brief descriptions:\n", + "\n", + "| **Evidence Code** | **Description** |\n", + "|-----------------------|-----------------|\n", + "| **EXP** | [Inferred from Experiment (EXP)](http://wiki.geneontology.org/index.php/Inferred_from_Experiment_(EXP)) |\n", + "| **IDA** | [Inferred from Direct Assay (IDA)](http://wiki.geneontology.org/index.php/Inferred_from_Direct_Assay_(IDA)) |\n", + "| **IPI** | [Inferred from Physical Interaction (IPI)](http://wiki.geneontology.org/index.php/Inferred_from_Physical_Interaction_(IPI)) |\n", + "| **IMP** | [Inferred from Mutant Phenotype (IMP)](http://wiki.geneontology.org/index.php/Inferred_from_Mutant_Phenotype_(IMP)) |\n", + "| **IGI** | [Inferred from Genetic Interaction (IGI)](http://wiki.geneontology.org/index.php/Inferred_from_Genetic_Interaction_(IGI)) |\n", + "| **IEP** | [Inferred from Expression Pattern (IEP)](http://wiki.geneontology.org/index.php/Inferred_from_Expression_Pattern_(IEP)) |\n", + "| **HTP** | [Inferred from High Throughput Experiment (HTP)](http://wiki.geneontology.org/index.php/Inferred_from_High_Throughput_Experiment_(HTP) ) |\n", + "| **HDA** | [Inferred from High Throughput Direct Assay (HDA)](http://wiki.geneontology.org/index.php/Inferred_from_High_Throughput_Direct_Assay_(HDA)) |\n", + "| **HMP** | [Inferred from High Throughput Mutant Phenotype (HMP)](http://wiki.geneontology.org/index.php/Inferred_from_High_Throughput_Mutant_Phenotype_(HMP)) |\n", + "| **HGI** | [Inferred from High Throughput Genetic Interaction (HGI)](http://wiki.geneontology.org/index.php/Inferred_from_High_Throughput_Genetic_Interaction_(HGI)) |\n", + "| **HEP** | [Inferred from High Throughput Expression Pattern (HEP)](http://wiki.geneontology.org/index.php/Inferred_from_High_Throughput_Expression_Pattern_(HEP)) |\n", + "| **IBA** | [Inferred from Biological aspect of Ancestor (IBA)](http://wiki.geneontology.org/index.php/Inferred_from_Biological_aspect_of_Ancestor_(IBA)) |\n", + "| **IBD** | [Inferred from Biological aspect of Descendant (IBD)](http://wiki.geneontology.org/index.php/Inferred_from_Biological_aspect_of_Descendant_(IBD)) |\n", + "| **IKR** | [Inferred from Key Residues (IKR)](http://wiki.geneontology.org/index.php/Inferred_from_Key_Residues_(IKR)) |\n", + "| **IRD** | [Inferred from Rapid Divergence (IRD)](http://wiki.geneontology.org/index.php/Inferred_from_Rapid_Divergence(IRD)) |\n", + "| **ISS** | [Inferred from Sequence or Structural Similarity (ISS)](http://wiki.geneontology.org/index.php/Inferred_from_Sequence_or_structural_Similarity_(ISS)) |\n", + "| **ISO** | [Inferred from Sequence Orthology (ISO)](http://wiki.geneontology.org/index.php/Inferred_from_Sequence_Orthology_(ISO)) |\n", + "| **ISA** | [Inferred from Sequence Alignment (ISA)](http://wiki.geneontology.org/index.php/Inferred_from_Sequence_Alignment_(ISA)) |\n", + "| **ISM** | [Inferred from Sequence Model (ISM)](http://wiki.geneontology.org/index.php/Inferred_from_Sequence_Model_(ISM)) |\n", + "| **RCA** | [Inferred from Reviewed Computational Analysis (RCA)](http://wiki.geneontology.org/index.php/Inferred_from_Reviewed_Computational_Analysis_(RCA)) |\n", + "| **IEA** | [Inferred from Electronic Annotation (IEA)](http://wiki.geneontology.org/index.php/Inferred_from_Electronic_Annotation_(IEA)) |\n", + "| **TAS** | [Traceable Author Statement (TAS)](http://wiki.geneontology.org/index.php/Traceable_Author_Statement_(TAS)) |\n", + "| **NAS** | [Non-traceable Author Statement (NAS)](http://wiki.geneontology.org/index.php/Non-traceable_Author_Statement_(NAS)) |\n", + "| **IC** | [Inferred by Curator (IC)](http://wiki.geneontology.org/index.php/Inferred_by_Curator_(IC)) |\n", + "| **ND** | [No Biological Data Available (ND)](http://wiki.geneontology.org/index.php/No_biological_Data_available_(ND)_evidence_code) |\n", + "| **NR** | Not Recorded |\n", + "\n", + "\n", + "### **Grouping of Codes**:\n", + "\n", + "- **Experimental Evidence Codes**:\n", + " - **EXP**, **IDA**, **IPI**, **IMP**, **IGI**, **IEP**\n", + " \n", + "- **High-Throughput Experimental Codes**:\n", + " - **HTP**, **HDA**, **HMP**, **HGI**, **HEP**\n", + "\n", + "- **Phylogenetically-Inferred Codes**:\n", + " - **IBA**, **IBD**, **IKR**, **IRD**\n", + "\n", + "- **Author/Curator Inferred Codes**:\n", + " - **TAS**, **IC**, **NAS**\n", + "\n", + "- **Computational Evidence Codes**:\n", + " - **IEA**, **ISS**, **ISA**, **ISM**, **ISO**, **RCA**\n", + "\n", + "- **Others**:\n", + " - **ND** (No Biological Data Available), **NR** (Not Recorded)\n", + "\n", + "\n", + "These evidence codes ensure transparency and give researchers an understanding of how confident they can be in a particular GO annotation.\n", + "\n", + "__Note__ : For more information on GO evidence codes please check [here](https://geneontology.org/docs/guide-go-evidence-codes/) " + ] + }, + { + "cell_type": "markdown", + "id": "1c11d6f520b02434", + "metadata": {}, + "source": [ + "---" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.14" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}