Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip test_mxfp on A770 #3811

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Skip test_mxfp on A770 #3811

wants to merge 1 commit into from

Conversation

alexbaden
Copy link
Contributor

The A770 job recently increased to 5+ hours. See if skipping mxfp brings the job back to taking a reasonable amount of time to complete.

@alexbaden
Copy link
Contributor Author

@alexbaden
Copy link
Contributor Author

It looks like the problem might be elsewhere:

2025-04-02T00:46:28.2125909Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_roundtrip[64-128] 
2025-04-02T00:46:28.2172047Z language/test_mxfp.py::TestMXFP4Tensor::test_roundtrip[128-256] 
2025-04-02T00:46:28.2177417Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_roundtrip[128-256] 
2025-04-02T00:46:28.2177809Z language/test_mxfp.py::TestMXFP4Tensor::test_packed_tensor[64-128-0] 
2025-04-02T00:46:28.2178219Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_packed_tensor[64-128-0] 
2025-04-02T00:46:28.2183865Z language/test_mxfp.py::TestMXFP4Tensor::test_packed_tensor[64-128-1] 
2025-04-02T00:46:28.2189466Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_packed_tensor[64-128-1] 
2025-04-02T00:46:28.2268489Z language/test_mxfp.py::TestMXFP4Tensor::test_padding 
2025-04-02T00:46:28.2273992Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_padding 
2025-04-02T00:46:28.2345226Z language/test_mxfp.py::TestMXFP4Tensor::test_zero_values 
2025-04-02T00:46:28.2347085Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_zero_values 
2025-04-02T00:46:28.2404407Z language/test_mxfp.py::TestMXFP4Tensor::test_out_of_range_values 
2025-04-02T00:46:28.2410500Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_out_of_range_values 
2025-04-02T00:46:28.2471944Z language/test_mxfp.py::TestMXFP4Tensor::test_subnormal_numbers 
2025-04-02T00:46:28.2478344Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_subnormal_numbers 
2025-04-02T00:46:28.2541177Z language/test_mxfp.py::TestMXFP4Tensor::test_rounding_edge_cases 
2025-04-02T00:46:28.2545330Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_rounding_edge_cases 
2025-04-02T00:46:28.2603290Z language/test_mxfp.py::TestMXFP4Tensor::test_negative_values 
2025-04-02T00:46:28.2611305Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_negative_values 
2025-04-02T00:46:28.2674351Z language/test_mxfp.py::TestMXFP4Tensor::test_negative_out_of_range 
2025-04-02T00:46:28.2678580Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_negative_out_of_range 
2025-04-02T00:46:28.2687468Z language/test_mxfp.py::TestMXFP4Tensor::test_packing[shape0-0] 
2025-04-02T00:46:28.2693143Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_packing[shape0-0] 
2025-04-02T00:46:28.2714263Z language/test_mxfp.py::TestMXFP4Tensor::test_packing[shape1-0] 
2025-04-02T00:46:28.2718110Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_packing[shape1-0] 
2025-04-02T00:46:28.2741327Z language/test_mxfp.py::TestMXFP4Tensor::test_packing[shape2-1] 
2025-04-02T00:46:28.2744587Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_packing[shape2-1] 
2025-04-02T00:46:28.3325006Z language/test_mxfp.py::TestMXFP4Tensor::test_packing[shape3-2] 
2025-04-02T00:46:28.3350421Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_packing[shape3-2] 
2025-04-02T00:46:28.3361875Z language/test_mxfp.py::TestMXFP4Tensor::test_packing_with_padding 
2025-04-02T00:46:28.3369208Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_packing_with_padding 
2025-04-02T00:46:28.3369978Z language/test_mxfp.py::TestMXFP4Tensor::test_invalid_packing_dimension 
2025-04-02T00:46:28.3370423Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_invalid_packing_dimension 
2025-04-02T00:46:28.3370829Z language/test_mxfp.py::TestMXFP4Tensor::test_empty_tensor 
2025-04-02T00:46:28.3374058Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXFP4Tensor::test_empty_tensor 
2025-04-02T00:46:28.3566036Z language/test_mxfp.py::TestMXScaleTensor::test_positive_values 
2025-04-02T00:46:28.3570554Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXScaleTensor::test_positive_values 
2025-04-02T00:46:28.3579667Z language/test_mxfp.py::TestMXScaleTensor::test_special_values 
2025-04-02T00:46:28.3584923Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXScaleTensor::test_special_values 
2025-04-02T00:46:28.3593176Z language/test_mxfp.py::TestMXScaleTensor::test_e8m0_nan_to_float_nan 
2025-04-02T00:46:28.3597680Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXScaleTensor::test_e8m0_nan_to_float_nan 
2025-04-02T00:46:28.3609523Z language/test_mxfp.py::TestMXScaleTensor::test_random_generation 
2025-04-02T00:46:28.3614505Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXScaleTensor::test_random_generation 
2025-04-02T00:46:28.3757897Z language/test_mxfp.py::TestMXScaleTensor::test_roundtrip[64-128] 
2025-04-02T00:46:28.3758474Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXScaleTensor::test_roundtrip[64-128] 
2025-04-02T00:46:28.3784567Z language/test_mxfp.py::TestMXScaleTensor::test_roundtrip[128-256] 
2025-04-02T00:46:28.3786843Z [gw9] [ 83%] PASSED language/test_mxfp.py::TestMXScaleTensor::test_roundtrip[128-256] 
2025-04-02T00:46:29.2190134Z language/test_pipeliner.py::test_pipeline_matmul[True] 
2025-04-02T00:46:29.2236929Z [gw2] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-32-128-True-False-False-e4m3-bf16-4-16-1] 
2025-04-02T00:46:39.3255801Z language/test_core.py::test_scaled_dot[64-32-128-True-False-False-e4m3-fp16-4-16-1] 
2025-04-02T00:46:39.3487657Z [gw7] [ 83%] FAILED language/test_core.py::test_dot[1-128-128-64-2-True-False-none-tf32-float16-float16-1-None0] 
2025-04-02T00:46:39.5185774Z language/test_core.py::test_dot[1-128-128-64-2-True-False-none-tf32-float16-float16-1-None1] 
2025-04-02T00:46:39.5412836Z [gw7] [ 83%] FAILED language/test_core.py::test_dot[1-128-128-64-2-True-False-none-tf32-float16-float16-1-None1] 
2025-04-02T00:46:41.5734880Z language/test_core.py::test_dot[1-128-128-64-2-True-False-none-tf32-float16-float32-1-None0] 
2025-04-02T00:46:41.5820945Z [gw2] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-32-128-True-False-False-e4m3-fp16-4-16-1] 
2025-04-02T00:46:49.3263329Z language/test_core.py::test_scaled_dot[64-32-128-True-False-False-e5m2-e4m3-4-16-1] 
2025-04-02T00:46:49.3273759Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e2m1-e5m2-4-16-1] 
2025-04-02T00:46:55.1115844Z language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e2m1-bf16-4-16-1] 
2025-04-02T00:46:55.1173653Z [gw2] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-32-128-True-False-False-e5m2-e4m3-4-16-1] 
2025-04-02T00:46:56.0486653Z language/test_core.py::test_scaled_dot[64-32-128-True-False-False-e5m2-e5m2-4-16-1] 
2025-04-02T00:46:56.0599715Z [gw3] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-64-128-True-True-True-e4m3-bf16-4-16-1] 
2025-04-02T00:47:00.9496762Z language/test_core.py::test_scaled_dot[64-64-128-True-True-True-e4m3-fp16-4-16-1] 
2025-04-02T00:47:00.9507398Z [gw8] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-128-128-True-False-False-e2m1-e4m3-4-16-1] 
2025-04-02T00:47:08.0838235Z language/test_core.py::test_scaled_dot[32-128-128-True-False-False-e2m1-e5m2-4-16-1] 
2025-04-02T00:47:08.0923648Z [gw2] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-32-128-True-False-False-e5m2-e5m2-4-16-1] 
2025-04-02T00:47:14.8609106Z language/test_core.py::test_scaled_dot[64-32-128-True-False-False-e5m2-bf16-4-16-1] 
2025-04-02T00:47:14.8852541Z [gw7] [ 83%] FAILED language/test_core.py::test_dot[1-128-128-64-2-True-False-none-tf32-float16-float32-1-None0] 
2025-04-02T00:47:15.0810343Z language/test_core.py::test_dot[1-128-128-64-2-True-False-none-tf32-float16-float32-1-None1] 
2025-04-02T00:47:15.1041030Z [gw7] [ 83%] FAILED language/test_core.py::test_dot[1-128-128-64-2-True-False-none-tf32-float16-float32-1-None1] 
2025-04-02T00:47:18.3712244Z language/test_core.py::test_dot[1-128-128-64-2-True-False-none-tf32-float32-float32-1-None0] 
2025-04-02T00:47:18.3724306Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e2m1-bf16-4-16-1] 
2025-04-02T00:47:20.8528309Z language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e2m1-fp16-4-16-1] 
2025-04-02T00:47:20.8618349Z [gw2] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-32-128-True-False-False-e5m2-bf16-4-16-1] 
2025-04-02T00:47:23.5122066Z language/test_core.py::test_scaled_dot[64-32-128-True-False-False-e5m2-fp16-4-16-1] 
2025-04-02T00:47:23.5261897Z [gw3] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-64-128-True-True-True-e4m3-fp16-4-16-1] 
2025-04-02T00:47:25.6734714Z language/test_core.py::test_scaled_dot[64-64-128-True-True-True-e5m2-e4m3-4-16-1] 
2025-04-02T00:47:25.6737429Z [gw0] [ 83%] PASSED language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e2m1-fp16-4-16-1] 
2025-04-02T00:47:30.2401890Z language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e4m3-e4m3-4-16-1] 
2025-04-02T00:47:30.2450895Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e4m3-e4m3-4-16-1] 
2025-04-02T00:47:32.3944296Z language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e4m3-e5m2-4-16-1] 
2025-04-02T00:47:32.4025300Z [gw2] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-32-128-True-False-False-e5m2-fp16-4-16-1] 
2025-04-02T00:47:34.5272181Z language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e2m1-e4m3-4-16-1] 
2025-04-02T00:47:34.5299008Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e4m3-e5m2-4-16-1] 
2025-04-02T00:47:38.3061781Z language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e4m3-bf16-4-16-1] 
2025-04-02T00:47:38.3104188Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e4m3-bf16-4-16-1] 
2025-04-02T00:47:41.3276055Z language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e4m3-fp16-4-16-1] 
2025-04-02T00:47:41.3319612Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e4m3-fp16-4-16-1] 
2025-04-02T00:47:45.1473175Z language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e5m2-e4m3-4-16-1] 
2025-04-02T00:47:45.1577107Z [gw2] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e2m1-e4m3-4-16-1] 
2025-04-02T00:47:45.7064040Z language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e2m1-e5m2-4-16-1] 
2025-04-02T00:47:45.7113740Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e5m2-e4m3-4-16-1] 
2025-04-02T00:47:46.5749188Z language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e5m2-e5m2-4-16-1] 
2025-04-02T00:47:46.5986461Z [gw7] [ 83%] FAILED language/test_core.py::test_dot[1-128-128-64-2-True-False-none-tf32-float32-float32-1-None0] 
2025-04-02T00:47:46.8034791Z language/test_core.py::test_dot[1-128-128-64-2-True-False-none-tf32-float32-float32-1-None1] 
2025-04-02T00:47:46.8262925Z [gw7] [ 83%] FAILED language/test_core.py::test_dot[1-128-128-64-2-True-False-none-tf32-float32-float32-1-None1] 
2025-04-02T00:47:48.6916456Z language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-int8-int8-1-None0] 
2025-04-02T00:47:48.6925072Z [gw8] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-128-128-True-False-False-e2m1-e5m2-4-16-1] 
2025-04-02T00:47:49.4630824Z language/test_core.py::test_scaled_dot[32-128-128-True-False-False-e2m1-bf16-4-16-1] 
2025-04-02T00:47:49.4677993Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e5m2-e5m2-4-16-1] 
2025-04-02T00:47:53.2389250Z language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e5m2-bf16-4-16-1] 
2025-04-02T00:47:53.2434743Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e5m2-bf16-4-16-1] 
2025-04-02T00:47:55.9288237Z language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e5m2-fp16-4-16-1] 
2025-04-02T00:47:55.9331802Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-False-e5m2-fp16-4-16-1] 
2025-04-02T00:47:56.2621286Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e2m1-e4m3-4-16-1] 
2025-04-02T00:47:56.2781517Z [gw3] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-64-128-True-True-True-e5m2-e4m3-4-16-1] 
2025-04-02T00:47:57.5065770Z language/test_core.py::test_scaled_dot[64-64-128-True-True-True-e5m2-e5m2-4-16-1] 
2025-04-02T00:47:57.5164194Z [gw2] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e2m1-e5m2-4-16-1] 
2025-04-02T00:48:01.6901555Z language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e2m1-bf16-4-16-1] 
2025-04-02T00:48:01.6909501Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e2m1-e4m3-4-16-1] 
2025-04-02T00:48:07.3920425Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e2m1-e5m2-4-16-1] 
2025-04-02T00:48:07.3929059Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e2m1-e5m2-4-16-1] 
2025-04-02T00:48:13.1597579Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e2m1-bf16-4-16-1] 
2025-04-02T00:48:13.1605930Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e2m1-bf16-4-16-1] 
2025-04-02T00:48:14.0946957Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e2m1-fp16-4-16-1] 
2025-04-02T00:48:14.0955660Z [gw2] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e2m1-bf16-4-16-1] 
2025-04-02T00:48:17.1219120Z language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e2m1-fp16-4-16-1] 
2025-04-02T00:48:17.1220531Z [gw0] [ 83%] PASSED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e2m1-fp16-4-16-1] 
2025-04-02T00:48:21.5217953Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e4m3-e4m3-4-16-1] 
2025-04-02T00:48:21.5427296Z [gw7] [ 83%] FAILED language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-int8-int8-1-None0] 
2025-04-02T00:48:21.6965997Z language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-int8-int8-1-None1] 
2025-04-02T00:48:21.7151051Z [gw7] [ 83%] FAILED language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-int8-int8-1-None1] 
2025-04-02T00:48:23.2639073Z language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float16-float16-1-None0] 
2025-04-02T00:48:23.2654010Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e4m3-e4m3-4-16-1] 
2025-04-02T00:48:27.9144977Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e4m3-e5m2-4-16-1] 
2025-04-02T00:48:27.9296486Z [gw3] [ 83%] FAILED language/test_core.py::test_scaled_dot[64-64-128-True-True-True-e5m2-e5m2-4-16-1] 
2025-04-02T00:48:28.3762413Z language/test_core.py::test_scaled_dot[64-64-128-True-True-True-e5m2-bf16-4-16-1] 
2025-04-02T00:48:28.3770660Z [gw2] [ 83%] PASSED language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e2m1-fp16-4-16-1] 
2025-04-02T00:48:29.3169271Z language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e4m3-e4m3-4-16-1] 
2025-04-02T00:48:29.3177620Z [gw0] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e4m3-e5m2-4-16-1] 
2025-04-02T00:48:30.3572536Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e4m3-bf16-4-16-1] 
2025-04-02T00:48:30.3583269Z [gw8] [ 83%] FAILED language/test_core.py::test_scaled_dot[32-128-128-True-False-False-e2m1-bf16-4-16-1] 
2025-04-02T00:48:31.1796767Z language/test_core.py::test_scaled_dot[32-128-128-True-False-False-e2m1-fp16-4-16-1] 
2025-04-02T00:48:31.1944573Z [gw7] [ 83%] FAILED language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float16-float16-1-None0] 
2025-04-02T00:48:31.4075628Z language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float16-float16-1-None1] 
2025-04-02T00:48:31.4240119Z [gw7] [ 84%] FAILED language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float16-float16-1-None1] 
2025-04-02T00:48:34.7887906Z language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float16-float32-1-None0] 
2025-04-02T00:48:34.7896305Z [gw0] [ 84%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e4m3-bf16-4-16-1] 
2025-04-02T00:48:39.3575820Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e4m3-fp16-4-16-1] 
2025-04-02T00:48:39.3583583Z [gw0] [ 84%] PASSED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e4m3-fp16-4-16-1] 
2025-04-02T00:48:40.8553789Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e5m2-e4m3-4-16-1] 
2025-04-02T00:48:40.8640749Z [gw2] [ 84%] FAILED language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e4m3-e4m3-4-16-1] 
2025-04-02T00:48:44.9212638Z language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e4m3-e5m2-4-16-1] 
2025-04-02T00:48:44.9369105Z [gw7] [ 84%] FAILED language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float16-float32-1-None0] 
2025-04-02T00:48:45.2599330Z language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float16-float32-1-None1] 
2025-04-02T00:48:45.2758798Z [gw7] [ 84%] FAILED language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float16-float32-1-None1] 
2025-04-02T00:48:45.3596694Z language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float32-float32-1-None0] 
2025-04-02T00:48:45.3608611Z [gw0] [ 84%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e5m2-e4m3-4-16-1] 
2025-04-02T00:48:50.8151577Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e5m2-e5m2-4-16-1] 
2025-04-02T00:48:50.8160742Z [gw0] [ 84%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e5m2-e5m2-4-16-1] 
2025-04-02T00:48:53.4658244Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e5m2-bf16-4-16-1] 
2025-04-02T00:48:53.4744252Z [gw2] [ 84%] FAILED language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e4m3-e5m2-4-16-1] 
2025-04-02T00:48:56.4506097Z language/test_core.py::test_scaled_dot[64-32-128-True-False-True-e4m3-bf16-4-16-1] 
2025-04-02T00:48:56.4515696Z [gw0] [ 84%] FAILED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e5m2-bf16-4-16-1] 
2025-04-02T00:48:58.3396427Z language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e5m2-fp16-4-16-1] 
2025-04-02T00:48:58.3553811Z [gw7] [ 84%] FAILED language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float32-float32-1-None0] 
2025-04-02T00:48:58.5377129Z language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float32-float32-1-None1] 
2025-04-02T00:48:58.5532743Z [gw7] [ 84%] FAILED language/test_core.py::test_dot[1-128-128-64-2-False-True-none-tf32-float32-float32-1-None1] 
2025-04-02T00:48:59.2567296Z language/test_core.py::test_dot[1-128-128-64-2-False-False-none-tf32-int8-int8-1-None0] 
2025-04-02T00:48:59.2714112Z [gw3] [ 84%] FAILED language/test_core.py::test_scaled_dot[64-64-128-True-True-True-e5m2-bf16-4-16-1] 
2025-04-02T00:49:00.3885323Z language/test_core.py::test_scaled_dot[64-64-128-True-True-True-e5m2-fp16-4-16-1] 
2025-04-02T00:49:00.3889279Z [gw0] [ 84%] PASSED language/test_core.py::test_scaled_dot[32-64-128-False-False-True-e5m2-fp16-4-16-1] 
2025-04-02T00:49:04.7308836Z Fatal Python error: Segmentation fault

@alexbaden
Copy link
Contributor Author

alexbaden commented Apr 2, 2025

trying again with new approach (not using skiplist): https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14228029469/job/39872117025

@@ -316,6 +316,8 @@ def fp8e8m0_to_float32(scale):
@pytest.mark.parametrize("NUM_STAGES", [1, 3])
@pytest.mark.parametrize("NUM_WARPS", [4, 8])
@pytest.mark.parametrize("nonKDim", ([0, 16, 32] if is_hip_cdna() else [0]))
@pytest.mark.skipif(is_xpu() and not torch.xpu.get_device_capability()['has_subgroup_matrix_multiply_accumulate'],
Copy link
Contributor

@pbchekin pbchekin Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the case when has_subgroup_matrix_multiply_accumulate not in capabilities.

Suggested change
@pytest.mark.skipif(is_xpu() and not torch.xpu.get_device_capability()['has_subgroup_matrix_multiply_accumulate'],
@pytest.mark.skipif(is_xpu() and not torch.xpu.get_device_capability().get('has_subgroup_matrix_multiply_accumulate', False),

In general, this decorator is executed in import-time, what is not convenient and Python best practice is minimize import-time logic. IMHO it is better to move this conditional skip to the test body.

@pbchekin
Copy link
Contributor

pbchekin commented Apr 2, 2025

trying again with new approach (not using skiplist): https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14228029469/job/39872117025

Please note this run uses the default skip list: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14228029469/job/39872117025#step:11:38

We do not (yet) autodetect a skip list based on the selected runner, so i would recommend specifying the skip list explicitly. The workflow "Build and test GPU" requires setting "Runner label" and "Skip list" inputs, so it is better suited for such runs.

@alexbaden
Copy link
Contributor Author

ok - what is the syntax for the skiplist parameter? is this documented somewhere?

@alexbaden alexbaden linked an issue Apr 2, 2025 that may be closed by this pull request
@pbchekin
Copy link
Contributor

pbchekin commented Apr 2, 2025

ok - what is the syntax for the skiplist parameter? is this documented somewhere?

It is not documented yet. For "Build and test" the default value is "default", for "Build and test GPU" it is empty by default, so we have to specify it. Basically it is the last directory name in the path to a skip list: https://github.com/intel/intel-xpu-backend-for-triton/tree/main/scripts/skiplist. For example: default, a770, lts, xe2 and so on.

@pbchekin
Copy link
Contributor

pbchekin commented Apr 2, 2025

For example: default, a770, lts, xe2 and so on.

Also a specified skip list can be identified in a successful run, in the step "Print inputs". For example: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14167030129/job/39682308605#step:2:35

@pbchekin
Copy link
Contributor

pbchekin commented Apr 2, 2025

It is not documented yet.

Specifying a skip list via workflow's input was considered as a temporary measure. The plan was to identify it automatically based on the selected runner. We still want this some day, just not enough capacity at the moment.

@alexbaden
Copy link
Contributor Author

resubmitted: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14231903364/job/39884095474
but the device says "max1100" even though the runner label says "a770". is that expected?

@pbchekin
Copy link
Contributor

pbchekin commented Apr 3, 2025

resubmitted: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14231903364/job/39884095474 but the device says "max1100" even though the runner label says "a770". is that expected?

Yes, looks good: correct runner and skip list. The input "device" is also a temporary input (with max1100 as a default value), will be defined by a runner when we implement everything right.

@whitneywhtsang
Copy link
Contributor

We could also use the same approach to skip test_scaled_dot.

@pbchekin
Copy link
Contributor

pbchekin commented Apr 4, 2025

Running tests in main on A770 on a runner with PYTEST_TIMEOUT=300, just as an experiment. If it finishes I am hoping to get a list of timed out tests: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14268229109/job/39995156716

@anmyachev
Copy link
Contributor

anmyachev commented Apr 8, 2025

@alexbaden you can now just specify test/unit/language/test_matmul.py::test_mxfp in a770 list to skip all tests
since 8a9787f was merged. Let's do this ASAP to unlock ARL, MTL CI as well.

@alexbaden alexbaden force-pushed the alex/update_client_skiplist branch from 843992c to 236116c Compare April 8, 2025 12:56
@alexbaden
Copy link
Contributor Author

A770 run with blanket test_mxfp skip: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14334102167

@anmyachev
Copy link
Contributor

A770 run with blanket test_mxfp skip: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14334102167

thanks! FYI: pre-commit checks failed

@alexbaden
Copy link
Contributor Author

yup - but the test is already running so let's just let it finish.

@pbchekin
Copy link
Contributor

pbchekin commented Apr 8, 2025

With the skip list from 1cc2bd3 the workflow run took ~2.5h: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14286539636.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A770 CI is timing out after several hours
5 participants