Preprocessing of pretraining data

Hi, Thanks for open-sourcing such a wonderful work.

Regarding the preprocessing of pretraining data, did you apply this [template prompts in ULIP](https://github.com/salesforce/ULIP/blob/main/data/templates.json) to both the raw texts, blip and msft generated captions, but not the retrieved texts right? As I can tell, in pretraining data, each *.npy file contains embeddings of "original" and "prompt_avg" versions for text_feat, blip_caption_feat, msft_caption_feat, only the retrieval_text_feat does not have "prompt_avg" version.

And could you please give me a hint on where to download the thumbnail images you used to extract the thumbnail feats as the thumbnail images are not included in released pretraining data, only the extracted thumbnail embeddings are available.

If you can provide your full preprocess file to extract text and image embeddings, it would be of great help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preprocessing of pretraining data #43

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Preprocessing of pretraining data #43

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions