Fix perplexity and bring text modality to SOTA possible by bhavul · Pull Request #90 · ManifoldRG/NEKO

bhavul · 2024-05-26T17:23:14Z

This PR fixes the perplexity bug that has been haunting NEKO for months! :)

When we added `padding='longest'` to the tokenization, it caused empty strings to be padded to the same length as the longest sample. That meant that `numel()` was never going to be 0. This commit moves the empty string check to the part of the code that loads the dataset into memory. Because of the way we're sampling indices in `sample_batch`...: ``` sampled_indices = torch.randperm(len(dataset_split))[:batch_size] samples = dataset_split.select(sampled_indices) tokenized_outputs = self.text_tokenizer(samples['text'], truncation=True, padding="longest", max_length=self.context_length, return_tensors='pt') ``` ... we were ending up with batches of varying sizes, depending on how many empty text strings we sampled. We always sample 64 indices, for example, but sometimes we might get 20 of them that are empty strings. Other times we might get 10 of them that are empty strings. Therefore our batch size would vary on each sample. By filtering when we load the dataset into memeory, we guarantee every batch size to be the same.

bhavul added 11 commits May 8, 2024 12:10

test nb

53a88bd

add a couple of notebooks that are easier to work with at least

e7e3270

perplexity is fine now

003d4ae

make sure pretrained flag works too

f8f6ce3

Fix predict_text for any text generation

44e1347

add nucleus sampling back to notebook

01f7b7d

add script which can use multiple GPUs

2958224

this version works

8a1d0a4

working config for minipile dataset also

13fe707

add evaluation files

de437ed

re-integrate text-modality fixes to neko codebase - part1/2

5bf0905

bhavul requested a review from eihli May 26, 2024 17:23

fix copying weights for only text token part when using pretrained_lm

a85ad05

eihli force-pushed the text_fix branch from 0ab561b to d85ca82 Compare June 3, 2024 17:33

eihli force-pushed the text_fix branch from d85ca82 to fdcca1b Compare June 3, 2024 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix perplexity and bring text modality to SOTA possible#90

Fix perplexity and bring text modality to SOTA possible#90
bhavul wants to merge 13 commits intomasterfrom
text_fix

bhavul commented May 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bhavul commented May 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants