Open
Description
我们可将文本tokenization分为三个阶段
-
第一步,就是原本的subword based算法(e.g., BPE, WordPiece, Unigram, SentencePiece。注:以前的BPE就是byte-level的);
-
第二步,就是进阶版的subword based算法(主要指从openai跳槽的karpathy 在今年二月propose的一种新的bpe方法,https://github.com/karpathy/minbpe 。我简单看了一下,这个minbpe更像是一个tutorial,提出minimal and clean code);
- Let's build the GPT Tokenizer (Video by Karpathy): https://www.youtube.com/watch?v=zduSFxRajkE
- larger vocabulary, less tokens (note: # of tokens are determined as a hyperparameter, roughly 1000000 in GPT4), larger Compression Ratio
- Forced splits using re (as in GPT2, to avoid the vocabularies such as
dog.
,dog?
,dog!
)- titoken, officiallly provided by OpenAI, u can choose gpt2 (not merge spaces) or cl100k_base (gpt4, merges spaces)
- There are special tokens. For example, in GPT2:
- if you like, you can also add special tokens such as FIM_PREFIX, FIM_SUFFIX, ENDOFPROMPT...
- minbpe: you can follow the [exercise.md] for the 5 steps required for building a tokenizer.
- SentencePiece: it can both train and inference BPE tokenizers (used in Llama and Mistral series) - Note: it has a lot of training arguments
- when
byte_fallback = True
,add_dummy_prefix=Ture
- About vocab_size: a very important hyperparameter
- Learning to compress prompts with Gist Tokens (Stanford, NeurIPS 2023) - introduces new tokens, compress prompts into smaller set of "gist" tokens, and make the model performance the same as when it has long prompts.
- Other modality also benefit from this tokenization (so that you can use the same transformer architecture on other modality)! e.g., (1) VQGAN - both hard tokens (integers) and soft tokens (do notr have to be discreate, but have bottlenecks like in autoencoders)
- (Sora) you can either process discreate tokens with regressive model or soft tokens with diffusion models.
- 第三步,是进进阶版的文本分词,代表为megabyte, https://github.com/lucidrains/MEGABYTE-pytorch (主要是meta整的新tokenzination方法。按他们自己所说:MEGABYTE allows byte-level models to perform competitively with subword models on long context language modeling)
Metadata
Metadata
Assignees
Labels
No labels