Skip to content

Commit 26f6e18

Browse files
committed
Mention need for whitespace tokenization
1 parent df70b05 commit 26f6e18

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

README

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ http://nlp.stanford.edu/projects/glove/
99

1010
This package includes four main tools:
1111
1) vocab_count
12-
Constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count.
12+
Constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count. This file should already consist of whitespace-separated tokens. Use something like the Stanford Tokenizer (http://nlp.stanford.edu/software/tokenizer.shtml) first on raw text.
1313
2) cooccur
1414
Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by 'vocab_count', and may specify a variety of parameters, as described by running './cooccur'.
1515
3) shuffle

0 commit comments

Comments
 (0)