Open
Description
Hi!
I want to build a vocabulary over a dataset containing >10M words.
To do this, I first want to reduce the number of tokens in my voc to 1000 words say, so when I call the build_vocab function, I specify
TEXT.build_vocab(train, max_size=args.vocab_size )
However, with this method, the vocab is indeed 1000 + words (plus unk and pad ) but the vocab has now a TEXT.vocab.freqs of size 10M !
Even though the vocabulary is 1000 words, the freqs table is still the same size of the original dataset. How to handle this ?
Thanks