[Notes] Text Tokenization in LLM

我们可将文本tokenization分为三个阶段
* 第一步，就是原本的subword based算法（e.g., BPE, WordPiece, Unigram, SentencePiece。注：以前的BPE就是byte-level的)；
 
* 第二步，就是进阶版的subword based算法（主要指从openai跳槽的karpathy 在今年二月propose的一种新的bpe方法，[https://github.com/karpathy/](https://github.com/karpathy/minbpe)[minbpe](https://github.com/karpathy/minbpe) 。我简单看了一下，这个minbpe更像是一个tutorial，提出minimal and clean code）；
> * Let's build the GPT Tokenizer (Video by Karpathy): https://www.youtube.com/watch?v=zduSFxRajkE
> * larger vocabulary, less tokens (note: # of tokens are determined as a hyperparameter, roughly 1000000 in GPT4), larger Compression Ratio
> * ![20240304](https://github.com/heathersherry/Knowledge-Graph-Tutorials-and-Papers/assets/1947675/64496367-0f48-42b4-9358-98b62b5b670e)
> * Forced splits using re (as in GPT2, to avoid the vocabularies such as `dog.`, `dog?`, `dog!`)
> * ![token](https://github.com/heathersherry/Knowledge-Graph-Tutorials-and-Papers/assets/1947675/af0ac5ea-5bbd-410e-98d0-22f855a27ad9)
> * titoken, officiallly provided by OpenAI, u can choose gpt2 (not merge spaces) or cl100k_base (gpt4, merges spaces)
> * There are special tokens. For example, in GPT2: ![20240307](https://github.com/heathersherry/Knowledge-Graph-Tutorials-and-Papers/assets/1947675/220f618f-a38e-4165-80e3-ffdf1b0393d2) - if you like, you can also add special tokens such as FIM_PREFIX, FIM_SUFFIX, ENDOFPROMPT...
> * minbpe: you can follow the [[exercise.md](https://github.com/karpathy/minbpe/blob/master/exercise.md)] for the 5 steps required for building a tokenizer.
> * SentencePiece: it can both train and inference BPE tokenizers (used in Llama and Mistral series) - Note: it has a lot of training arguments
> * ![20240307-2](https://github.com/heathersherry/Knowledge-Graph-Tutorials-and-Papers/assets/1947675/ab0850ab-645c-420c-9486-54f17498a71a)
> * ![20240307-3](https://github.com/heathersherry/Knowledge-Graph-Tutorials-and-Papers/assets/1947675/1053f126-41b6-457e-b9ef-abed98c994cc)
> * ![20240307-4](https://github.com/heathersherry/Knowledge-Graph-Tutorials-and-Papers/assets/1947675/0d55f3d8-4719-4337-9541-6cb1785da485) - when `byte_fallback = True`, `add_dummy_prefix=Ture`
> * About vocab_size: a very important hyperparameter
> * Learning to compress prompts with Gist Tokens (Stanford, NeurIPS 2023) - introduces new tokens, compress prompts into smaller set of "gist" tokens, and make the model performance the same as when it has long prompts.
> * Other modality also benefit from this tokenization (so that you can use the same transformer architecture on other modality)! e.g., (1) VQGAN - both hard tokens (integers) and soft tokens (do notr have to be discreate, but have bottlenecks like in autoencoders)
> * ![20240307-5](https://github.com/heathersherry/Knowledge-Graph-Tutorials-and-Papers/assets/1947675/20922d0f-e19a-4400-b644-041ebd0613e4)
> * (Sora) you can either process discreate tokens with regressive model or soft tokens with diffusion models.
![20240307-6](https://github.com/heathersherry/Knowledge-Graph-Tutorials-and-Papers/assets/1947675/0e225ccf-c7cb-4de3-bd3f-5f63d2abc3d1)
> * 









* 第三步，是进进阶版的文本分词，代表为megabyte， [https://github.com/lucidrains/](https://github.com/lucidrains/MEGABYTE-pytorch)[MEGABYTE-pytorch](https://github.com/lucidrains/MEGABYTE-pytorch) （主要是meta整的新tokenzination方法。按他们自己所说：MEGABYTE allows byte-level models to perform competitively with subword models on long context language modeling）

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Notes] Text Tokenization in LLM #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Notes] Text Tokenization in LLM #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions