max_size for build_vocab 

Hi!

I want to build a vocabulary over a dataset containing >10M words. 
To do this, I first want to reduce the number of tokens in my voc to 1000 words say, so when I call the build_vocab function, I specify 
`TEXT.build_vocab(train, max_size=args.vocab_size )`

However, with this method, the vocab is indeed 1000 + words (plus unk and pad ) but the vocab has now a TEXT.vocab.freqs of size 10M ! 
Even though the vocabulary is 1000 words, the freqs table is still the same size of the original dataset. How to handle this ?


Thanks


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

max_size for build_vocab #343

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

max_size for build_vocab #343

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions