-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FlexAttention? #1685
Comments
Can you add some more context here @johnnynunez |
I want to quantize the lerobot pizero model, which has FlexAttention. @drisspg context: |
So currently all of our quantization APIs target linear layers and are orthogonal to flex attention. Therefore, yes, flex attention should work. Flex-Tension currently doesn't support low precision inputs, however, that is planned - no ETA yet though |
thanks! I'm going to try |
Let me know if anything comes up! |
@drisspg this is a good point -- what will happen with a low-precision input? will it get upcast to bf16 for the actual matmul? if so, are you basically seeing VRAM savings but no time savings? |
I have an example for doing this and @danielvegamyhre is starting to investigate and ultimately make this a well supported path. For an fp8 mm the mm will be in low precision and in theory can utilize tensors cores w/ fp8 support on H100 and accumulated into high |
Is it compatible flexattention from pytorch 2.6.0?
The text was updated successfully, but these errors were encountered: