Skip to content

max_size for build_vocab  #343

Open
Open
@simon555

Description

@simon555

Hi!

I want to build a vocabulary over a dataset containing >10M words.
To do this, I first want to reduce the number of tokens in my voc to 1000 words say, so when I call the build_vocab function, I specify
TEXT.build_vocab(train, max_size=args.vocab_size )

However, with this method, the vocab is indeed 1000 + words (plus unk and pad ) but the vocab has now a TEXT.vocab.freqs of size 10M !
Even though the vocabulary is 1000 words, the freqs table is still the same size of the original dataset. How to handle this ?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions