Skip to content

On Vocab Factory functions behavior #1305

Open
@parmeet

Description

@parmeet

Related discussion #1016
Related PRs #1304, #1302


torchtext provides several factory functions to construct Vocab class object. The primary ways to construct vocabulary are:

  1. Reading raw text from file followed by tokenization to get token entries.
  2. Reading token entries directly from file
  3. Through iterators that yields iterator or list of tokens
  4. Through user supplied ordered dictionary that maps tokens to their corresponding occurrence frequencies

Typically a vocabulary not only serve the purpose of numericalizing supplied tokens, but they also provide index for special occasions for example when the queried token is out of vocabulary (OOV) or when we need indices for special places like padding, masking, sentence beginning and end etc.

As the NLP is fast evolving, research and applied community alike will find novel and creative ways to push the frontiers of the field. Hence as a platform provider for NLP research and application, it is best not to make assumptions on special symbols including unknown token. We shall provide the aforementioned factory functions with minimal API requirements. We would expect the user to set the special symbols and fallback index through low level APIs of Vocab class.

Below are the examples of few scenarios and use cases:

Note that querying OOV token through Vocab object without setting default index would raise RuntimeError. Hence it is necessary to explicitly set this through API unless user wants to explicitly handle the runtime error as and when it happens. In below examples we set the default index to be same as index of <unk> token.

Example 1: Creating Vocab through text file and explicitly handling special symbols and fallback scenario

from torchtext.vocab import build_vocab_from_text_file
vocab = build_vocab_from_text_file("path/to/raw_text.txt", min_freq = 1)
special_symbols = {'<unk>':0,'<pad>':1,'<s>':2,'</s>':3} 
default_index = special_symbols['<unk>']
for token, index in special_symbols.items():
    if token in vocab:
        vocab.reassign_token(token, index)
    else:
        vocab.insert_token(token, index)
vocab.set_default_index(default_index)

Example 2: Reading vocab directly from file with all the special symbols and setting fallback index to unknown token

from torchtext.vocab import build_vocab_from_file
unk_token = '<unk>'
vocab = build_vocab_from_text_file("path/to/tokens.txt", min_freq = 1)
assert unk_token in vocab
vocab.set_default_index(vocab[unk_token])

Example 3: Building Vocab using Iterators and explicitly adding special symbols and fallback index

from torchtext.vocab import build_vocab_from_iterator
special_symbols = {'<unk>':0,'<pad>':1,'<s>':2,'</s>':3} 
vocab = build_vocab_from_iterator(iter_obj, min_freq = 1)
for token, index in special_symbols.items():
    if token in vocab:
        vocab.reassign_token(token, index)
    else:
        vocab.insert_token(token, index)
vocab.set_default_index(vocab[unk_token])

Example 4: Creating vocab through user supplied ordered dictionary that also contains all the special symbols

from torchtext.vocab import vocab as vocab_factory
unk_token = '<unk>'
vocab = vocab_factory(ordered_dict, min_freq = 1)
assert unk_token in vocab
vocab.set_default_index(vocab[unk_token])

Furthermore, legacy Vocab class constructor provide additional arguments to build Vocab using Counters. Here it provide support to add special symbols directly through input arguments rather than calling any low-level API.

We would love to hear from our users and community if the factory functions above is a good trade-off between flexibility and abstraction or if users would like to handle special symbols and default index through API arguments instead of explicitly calling the low level APIs of Vocab class.

with @cpuhrsch

cc: @hudeven, @snisarg, @dongreenberg

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions