Skip to content

Optimize Token Indexing in Reader class by Using Dictionary for Constant-Time Lookup #81

@aditya0by0

Description

@aditya0by0

Description:

Currently, in ChemDataReader, we are using a list to store tokens from the token.txt file in memory. Each token’s index is retrieved using a linear search through the list, resulting in O(n) time complexity for each token lookup.
We first check if token exists in list and then retrieve the index of the token which give us 2×O(n) time complexity to be specific.

with open(self.token_path, "r") as pk:
    self.cache = [x.strip() for x in pk]
...
if not str(token) in self.cache:
    self.cache.append(str(token))
return self.cache.index(str(token)) + EMBEDDING_OFFSET

Problem:

This design causes the overall preprocessing complexity for encoding token indices from data.pkl to data.pt pre-preprocessing stage to be:

O(Dataset Size × Tokens per Input × Vocab Size)

This becomes especially inefficient for datasets with large vocabularies, such as protein sequences using trigram tokens (vocab sizes > 8000).

Proposal:

Refactor to use a dict for token storage during preprocessing. Tokens will be the keys and their corresponding indices will be the values. This change will reduce the token lookup time to O(1), improving preprocessing performance significantly.

Benefits:

  • Improved time complexity:
    From
    O(Dataset Size × Tokens per Input × Vocab Size)
    to
    O(Dataset Size × Tokens per Input)

  • Memory-efficient and order-preserving:
    From Python 3.7+, dict preserves insertion order (PEP 468, What's New in Python 3.7)
    — helpful if the order of tokens is important when saving/loading.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions