Optimize Token Indexing in Reader class by Using Dictionary for Constant-Time Lookup

**Description:**

Currently, in `ChemDataReader`, we are using a list to store tokens from the `token.txt` file in memory. Each token’s index is retrieved using a linear search through the list, resulting in **O(n)** time complexity for each token lookup.
We first check if token exists in list and then retrieve the index of the token which give us  **2**×**O(n)** time complexity to be specific.

```python
with open(self.token_path, "r") as pk:
    self.cache = [x.strip() for x in pk]
...
if not str(token) in self.cache:
    self.cache.append(str(token))
return self.cache.index(str(token)) + EMBEDDING_OFFSET
```

### Problem:
This design causes the overall preprocessing complexity for encoding token indices from `data.pkl` to `data.pt` pre-preprocessing stage to be:

**O(Dataset Size × Tokens per Input × Vocab Size)**

This becomes especially inefficient for datasets with large vocabularies, such as protein sequences using trigram tokens (vocab sizes > 8000).

### Proposal:
Refactor to use a `dict` for token storage during preprocessing. Tokens will be the keys and their corresponding indices will be the values. This change will reduce the token lookup time to **O(1)**, improving preprocessing performance significantly.

### Benefits:
- **Improved time complexity:**  
  From  
  `O(Dataset Size × Tokens per Input × Vocab Size)`  
  to  
  `O(Dataset Size × Tokens per Input)`

- **Memory-efficient and order-preserving:**  
  From Python 3.7+, `dict` preserves insertion order ([PEP 468](https://mail.python.org/pipermail/python-dev/2017-December/151283.html), [What's New in Python 3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights))  
  — helpful if the order of tokens is important when saving/loading.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize Token Indexing in Reader class by Using Dictionary for Constant-Time Lookup #81

Problem:

Proposal:

Benefits:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize Token Indexing in Reader class by Using Dictionary for Constant-Time Lookup #81

Description

Problem:

Proposal:

Benefits:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions