Open
Description
A couple vocab / field features:
- Would love for Fields to be pickle'able. I understand that other issues, but we really can't be creating vocab on the fly, it's much too slow.
I think something like this would work?
def _getstate(self):
return dict(self.__dict__, stoi=dict(self.stoi))
def _setstate(self, state):
self.__dict__.update(state)
self.stoi = defaultdict(lambda: 0, self.stoi)
torchtext.vocab.Vocab.__getstate__ = _getstate
torchtext.vocab.Vocab.__setstate__ = _setstate
- Would be neat to support two fields sharing a vocab. For instance, if we want to have a separate field for input and output in a language model, but share the weight matrix, or share source and target vocab. We currently do this, but it breaks a lot if things like specials change. Be great to have it in the code.
def merge_vocabs(vocabs, vocab_size=None):
"""
Merge individual vocabularies (assumed to be generated from disjoint
documents) into a larger vocabulary.
Args:
vocabs: `torchtext.vocab.Vocab` vocabularies to be merged
vocab_size: `int` the final vocabulary size. `None` for no limit.
Return:
`torchtext.vocab.Vocab`
"""
merged = sum([vocab.freqs for vocab in vocabs], Counter())
return torchtext.vocab.Vocab(merged,
specials=[UNK_WORD, PAD_WORD,
BOS_WORD, EOS_WORD],
max_size=vocab_size)