-
Notifications
You must be signed in to change notification settings - Fork 603
Description
Hi Vaex team,
thank you for maintaining Vaex and making out-of-core DataFrames practical. This kind of tooling is very common in large RAG / LLM stacks where people have to process many millions of documents or events before embedding them.
I maintain an MIT-licensed project called WFGY Problem Map, a 16-question diagnostic checklist for debugging RAG / LLM pipelines in production. It focuses on failure modes in ingestion, chunking, indexing, and evaluation, especially at larger scales.
Why this might be relevant for Vaex users:
- Vaex is often chosen when the corpus is much larger than memory and has to be streamed and transformed before feeding a vector store.
- Several of the 16 failure modes are exactly about silent coverage gaps when processing huge tables that later become retrieval corpora.
- The checklist is framework-agnostic, so it can be used with Vaex without any integration effort.
WFGY Problem Map has been referenced by:
- ToolUniverse (Harvard MIMS Lab)
- Multimodal RAG Survey (QCRI LLM Lab)
- Rankify (University of Innsbruck)
Suggestion:
If you think it might help your users, would you consider adding a short external link in your docs for RAG / LLM use cases?
“RAG / LLM debugging checklist: WFGY Problem Map (16 failure modes)”
https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md
Project home: https://github.com/onestardao/WFGY
Thanks a lot for considering and for all your work on Vaex.
Best,
PSBigBig