-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Description:
Currently, in ChemDataReader, we are using a list to store tokens from the token.txt file in memory. Each token’s index is retrieved using a linear search through the list, resulting in O(n) time complexity for each token lookup.
We first check if token exists in list and then retrieve the index of the token which give us 2×O(n) time complexity to be specific.
with open(self.token_path, "r") as pk:
self.cache = [x.strip() for x in pk]
...
if not str(token) in self.cache:
self.cache.append(str(token))
return self.cache.index(str(token)) + EMBEDDING_OFFSETProblem:
This design causes the overall preprocessing complexity for encoding token indices from data.pkl to data.pt pre-preprocessing stage to be:
O(Dataset Size × Tokens per Input × Vocab Size)
This becomes especially inefficient for datasets with large vocabularies, such as protein sequences using trigram tokens (vocab sizes > 8000).
Proposal:
Refactor to use a dict for token storage during preprocessing. Tokens will be the keys and their corresponding indices will be the values. This change will reduce the token lookup time to O(1), improving preprocessing performance significantly.
Benefits:
-
Improved time complexity:
From
O(Dataset Size × Tokens per Input × Vocab Size)
to
O(Dataset Size × Tokens per Input) -
Memory-efficient and order-preserving:
From Python 3.7+,dictpreserves insertion order (PEP 468, What's New in Python 3.7)
— helpful if the order of tokens is important when saving/loading.