Skip to content

enable distributed cases based on local branch #1538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 57 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
a2c3f35
enable fsdp cases based on local branch
daisyden Apr 2, 2025
e772d23
add 2025.0 WA
daisyden Apr 3, 2025
cbd34cd
Update distributed UT cases in DDP and PP
PenghuiCheng Apr 3, 2025
d856e95
Fixed pylint error
PenghuiCheng Apr 3, 2025
28a259e
Fixed pylint error
PenghuiCheng Apr 3, 2025
62e9ff7
add distributed ut in CI
zxd1997066 Apr 5, 2025
119d2fb
update if condition
zxd1997066 Apr 5, 2025
5ff20ba
keep_torch_xpu_ops
zxd1997066 Apr 5, 2025
cc472d7
update keyword in distributed ut check
zxd1997066 Apr 6, 2025
60dbd6e
update pytorch build
zxd1997066 Apr 7, 2025
af0bca9
enable fsdp cases based on local branch
daisyden Apr 2, 2025
6885a00
add 2025.0 WA
daisyden Apr 3, 2025
cd013d7
Update distributed UT cases in DDP and PP
PenghuiCheng Apr 3, 2025
cd92f23
Fixed pylint error
PenghuiCheng Apr 3, 2025
413c2b0
Fixed pylint error
PenghuiCheng Apr 3, 2025
ab68eee
add distributed ut in CI
zxd1997066 Apr 5, 2025
c5ec140
update if condition
zxd1997066 Apr 5, 2025
edc9e1b
keep_torch_xpu_ops
zxd1997066 Apr 5, 2025
6c9e99a
update keyword in distributed ut check
zxd1997066 Apr 6, 2025
bdfa853
update pytorch build
zxd1997066 Apr 7, 2025
0e77f30
update if condition
zxd1997066 Apr 7, 2025
faf4a7f
Merge branch 'main' of https://github.com/intel/torch-xpu-ops into da…
daisyden Apr 8, 2025
4076a1a
resolve Artifact name conflict
zxd1997066 Apr 7, 2025
5596ac4
enabled test_sharder.py on xpu
daisyden Apr 8, 2025
2ed7973
Enabled UT for test/distributed/tensor
PenghuiCheng Apr 9, 2025
8b63191
Merge from daisyden/distributed_2.8 branch
PenghuiCheng Apr 9, 2025
5bab858
add FSDP2 cases, improved check-ut.py for summary, do ZE_AFFINITY_MAS…
daisyden Apr 10, 2025
f1b824d
Skip test_schedule_multiproc.py for hang error
PenghuiCheng Apr 10, 2025
2a47caf
Merge branch 'daisyden/distributed_2.8' of https://github.com/intel/t…
PenghuiCheng Apr 10, 2025
f696faa
refine error log for test files without pytest
PenghuiCheng Apr 15, 2025
e9ace29
Merge remote-tracking branch 'origin/daisyden/distributed_2.8' into d…
PenghuiCheng Apr 15, 2025
00326ac
Fixed error for create log file without pytest
PenghuiCheng Apr 15, 2025
59c609e
Skipped cases rasied issue
PenghuiCheng Apr 16, 2025
b5eba76
Merge remote-tracking branch 'origin/daisyden/distributed_2.8' into d…
PenghuiCheng Apr 16, 2025
ff926e3
Merge remote-tracking branch 'origin/main' into daisyden/distributed_2.8
PenghuiCheng Apr 16, 2025
de00feb
Update ut summary
RUIJIEZHONG66166 Apr 16, 2025
f0e1128
align the path
RUIJIEZHONG66166 Apr 16, 2025
4c3651e
update ut
zxd1997066 Apr 16, 2025
6f635a7
add distributed ut summary
RUIJIEZHONG66166 Apr 16, 2025
e9b1ba9
fix lint issue
zxd1997066 Apr 16, 2025
526c0a6
Merge branch 'daisyden/distributed_2.8' of https://github.com/intel/t…
RUIJIEZHONG66166 Apr 16, 2025
14773da
fix lint issue
zxd1997066 Apr 16, 2025
5d9d94b
fix lint issue
zxd1997066 Apr 16, 2025
5197d87
update
zxd1997066 Apr 16, 2025
be64dbe
update
zxd1997066 Apr 16, 2025
0e44577
update
zxd1997066 Apr 16, 2025
d0a0609
comment pdb
zxd1997066 Apr 17, 2025
65d1953
align the path
RUIJIEZHONG66166 Apr 17, 2025
415abe7
Skipped error cases
PenghuiCheng Apr 18, 2025
4bedfb6
merge from daisyden/distributed_2.8
PenghuiCheng Apr 18, 2025
c555fbb
fixed lint error
PenghuiCheng Apr 18, 2025
6d6a75e
fixed lint error
PenghuiCheng Apr 18, 2025
1f451b2
Add some UT cases
PenghuiCheng Apr 24, 2025
d5a84ca
merge from main branch
PenghuiCheng Apr 24, 2025
b2c5875
Add UT cases for _shard and _tools folder
PenghuiCheng Apr 29, 2025
177d7c0
Clean skip list
PenghuiCheng May 5, 2025
4ca9f70
Merge remote-tracking branch 'origin/main' into daisyden/distributed_2.8
PenghuiCheng May 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
299 changes: 223 additions & 76 deletions .github/scripts/check-ut.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,47 @@
import argparse
import sys
import os
import re
from junitparser import JUnitXml, Error, Failure, Skipped

parser = argparse.ArgumentParser()
parser.add_argument('junitxml', nargs='+')
parser = argparse.ArgumentParser(description='Test results analyzer')
parser.add_argument('input_files', nargs='+', help='JUnit XML files or log files')
args = parser.parse_args()

failures = []
suites = []
summaries = []

error_types = [
"RuntimeError",
"ValueError",
"TypeError",
"AttributeError",
"KeyError",
"IndexError",
"ImportError",
"AssertionError",
"Exception",
"OSError",
"Failed",
"TimeoutError",
"asyncio.TimeoutError",
"FileNotFoundError",
"PermissionError",
"NotImplementedError",
]

def get_classname(case):
return ' '.join(case.classname.split())
return ' '.join(case.classname.split()) if hasattr(case, 'classname') else case.get('classname', '')

def get_name(case):
if isinstance(case, dict):
return case.get('name', '')
return ' '.join(case.name.split())

def get_result(case):
if isinstance(case, dict):
return case.get('status', 'failed')

result = "passed"
if case.result:
if isinstance(case.result[0], Error):
Expand All @@ -28,88 +53,210 @@ def get_result(case):
return result

def get_message(case):
if isinstance(case, dict):
return case.get('error', '')

if not case.result:
return ""
return f"{case.result[0].message.splitlines()[0]}"

def print_md_row(row, print_header):
full_text = case.result[0].text if hasattr(case.result[0], 'text') else case.result[0].message
if not full_text:
return ""

error_messages = []
capture_next_lines = False
indent_level = 0

for line in full_text.splitlines():
stripped_line = line.strip()
if not stripped_line:
continue

for error_type in error_types:
if stripped_line.startswith(error_type + ": "):
error_msg = stripped_line[len(error_type)+2:]
error_messages.append(f"{error_type}: {error_msg}")
capture_next_lines = True
indent_level = 0
break
elif f"{error_type}:" in stripped_line and "Traceback" not in stripped_line:
error_msg = stripped_line.split(f'{error_type}:')[-1].strip()
error_messages.append(f"{error_type}: {error_msg}")
capture_next_lines = True
indent_level = 0
break

return " ; ".join(error_messages) if error_messages else f"{case.result[0].message.splitlines()[0]}"


def print_md_row(row, print_header=False):
if print_header:
header = " | ".join([f"{key}" for key, _ in row.items()])
header = " | ".join([f"{key}" for key in row.keys()])
print(f"| {header} |")
header = " | ".join(["-"*len(key) for key, _ in row.items()])
header = " | ".join(["---"] * len(row))
print(f"| {header} |")
row = " | ".join([f"{value}" for _, value in row.items()])
print(f"| {row} |")
row_values = " | ".join([f"{value}" for value in row.values()])
print(f"| {row_values} |")

def print_cases(cases):
def print_failures():
if not failures:
return

print("### Test Failures")
print_header = True
for case in cases:
classname = get_classname(case)
name = get_name(case)
result = get_result(case)
message = get_message(case)
row = {
'Class name': classname,
'Test name': name,
'Status': result,
'Message': message,
}
print_md_row(row, print_header)
for case in failures:
print_md_row({
'Class name': get_classname(case),
'Test name': get_name(case),
'Status': get_result(case),
'Message': get_message(case),
'Source': case['source'] if isinstance(case, dict) else 'XML'
}, print_header)
print_header = False

def print_suite(suite):
def parse_log_file(log_file):
with open(log_file, encoding='utf-8') as f:
content = f.read()

ut_name = os.path.splitext(os.path.basename(log_file))[0]
summary = {
'Category': determine_category(ut_name),
'UT': ut_name,
'Test cases': 0,
'Passed': 0,
'Skipped': 0,
'Failures': 0,
'Errors': 0,
'Source': 'Log'
}

# Extract test counts
test_run_match = re.search(r"Ran (\d+) tests in [\d.]+s", content)
if test_run_match:
summary['Test cases'] = int(test_run_match.group(1))

# Extract skipped case number
skipped_match = re.search(r"skipped[ =](\d+)", content, re.IGNORECASE)
if skipped_match:
summary['Skipped'] = int(skipped_match.group(1))
else:
skipped_match = re.search(r"skipped (\d+) cases?", content, re.IGNORECASE)
if skipped_match:
summary['Skipped'] = int(skipped_match.group(1))

# Extract failures
failure_blocks = re.findall(r"(FAIL:.*?)(?:\n\n|\n=+\n|\Z)", content, re.DOTALL)
exist_test_names = set()
failures_number = 0

for block in failure_blocks:
case_match = re.match(r"FAIL: (\w+) \(__mp_main__\.(\w+)\)", block)
if not case_match:
continue

test_name = case_match.group(1)
if test_name in exist_test_names:
continue
exist_test_names.add(test_name)

error_msg = []
error_pattern = r"(" + "|".join(error_types) + r"):.*?(?=\n\S|\n\n|\n=+\n|\Z)"
error_matches = re.finditer(error_pattern, block, re.DOTALL)
if not error_matches and "Traceback" in block:
error_msg.append("Unknown error (see traceback)")
else:
for match in error_matches:
error_msg.append(match.group(0).strip())

failures.append({
'classname': ut_name,
'name': f"{case_match.group(2)}:{test_name}",
'error': " ".join(error_msg),
'status': 'failed',
'source': 'Log'
})
failures_number += 1

if failures_number > summary['Failures']:
summary['Failures'] = failures_number
summary['Passed'] = summary['Test cases'] - summary['Failures'] - summary['Skipped']

return summary

def determine_category(ut):
if ut == 'op_regression':
return 'op_regression'
elif ut == 'op_regression_dev1':
return 'op_regression_dev1'
elif ut == 'op_extended':
return 'op_extended'
elif 'op_ut' in ut:
return 'op_ut'
else:
return 'unknown'

def process_log_file(log_file):
try:
summary = parse_log_file(log_file)
summaries.append(summary)
except Exception as e:
print(f"Error processing {log_file}: {e}", file=sys.stderr)

def process_xml_file(xml_file):
try:
xml = JUnitXml.fromfile(xml_file)
ut = os.path.basename(xml_file).split('.')[0]
category = determine_category(ut)

for suite in xml:
suite_summary = {
'Category': category,
'UT': ut,
'Test cases': suite.tests,
'Passed': suite.tests - suite.skipped - suite.failures - suite.errors,
'Skipped': suite.skipped,
'Failures': suite.failures,
'Errors': suite.errors,
'Source': 'XML'
}
summaries.append(suite_summary)

for case in suite:
if get_result(case) not in ["passed", "skipped"]:
failures.append(case)
except Exception as e:
print(f"Error processing {xml_file}: {e}", file=sys.stderr)

def print_summary():
print("### Results Summary")
print_header = True
for suite in suites:
ut = args.junitxml[0]
del(args.junitxml[0])
ut = os.path.basename(ut).split('.')[0]
tests = suite.tests
skipped = suite.skipped
failures = suite.failures
errors = suite.errors
if ut == 'op_regression':
category = 'op_regression'
elif ut == 'op_regression_dev1':
category = 'op_regression_dev1'
elif ut == 'op_extended':
category = 'op_extended'
elif 'op_ut' in ut:
category = 'op_ut'
row = {
'Category': category,
'UT': ut,
'Test cases': tests,
'Passed': tests-skipped-failures-errors,
'Skipped': skipped,
'Failures': failures,
'Errors': errors,
}
print_md_row(row, print_header)

for summary in summaries:
print_md_row({
'Category': summary['Category'],
'UT': summary['UT'],
'Test cases': summary['Test cases'],
'Passed': summary['Passed'],
'Skipped': summary['Skipped'],
'Failures': summary['Failures'],
'Errors': summary['Errors'],
'Source': summary['Source']
}, print_header)

print_header = False

xmls = [ JUnitXml.fromfile(f) for f in args.junitxml ]
for idx, xml in enumerate(xmls):
for suite in xml:
suites.append(suite)
for case in suite:
classname = get_classname(case)
name = get_name(case)
result = get_result(case)
if result not in ["passed", "skipped"]:
failures.append(case)

printed = False
def print_break(needed):
if needed:
print("")

if failures:
print_break(printed)
print("### Failures")
print_cases(failures)
printed = True

print("### Results Summary")
print_suite(suites)

sys.exit(0)
def main():
for input_file in args.input_files:
if input_file.endswith('.log'):
process_log_file(input_file)
elif input_file.endswith('.xml'):
process_xml_file(input_file)
else:
print(f"Skipping unknown file type: {input_file}", file=sys.stderr)

print_failures()
print_summary()


if __name__ == "__main__":
main()
12 changes: 6 additions & 6 deletions .github/scripts/ut_result_check.sh
Original file line number Diff line number Diff line change
Expand Up @@ -72,14 +72,14 @@ if [[ "${ut_suite}" == 'torch_xpu' ]]; then
echo -e "[PASS] UT ${ut_suite} test Pass"
fi
fi
if [[ "${ut_suite}" == 'xpu_distributed' ]]; then
grep -E "^FAILED|have failures" xpu_distributed_test.log | awk '{print $2}' > ./"${ut_suite}"_xpu_distributed_test_failed.log
num_failed_xpu_distributed=$(wc -l < "./${ut_suite}_xpu_distributed_test_failed.log")
if [[ "${ut_suite}" == 'xpu_distributed' || "${ut_suite}" == 'pytorch_distributed' ]]; then
grep -E "^FAILED|have failures" "${ut_suite}"_test.log | awk '{print $2}' > ./"${ut_suite}"_test_failed.log
num_failed_distributed=$(wc -l < "./${ut_suite}_test_failed.log")
echo -e "========================================================================="
echo -e "Show Failed cases in ${ut_suite} xpu distributed"
echo -e "Show Failed cases in ${ut_suite}"
echo -e "========================================================================="
cat "./${ut_suite}_xpu_distributed_test_failed.log"
((num_failed=num_failed_xpu_distributed))
cat "./${ut_suite}_test_failed.log"
((num_failed=num_failed_distributed))
if [[ $num_failed -gt 0 ]]; then
echo -e "[ERROR] UT ${ut_suite} test Fail"
exit 1
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/_linux_build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -163,13 +163,13 @@ jobs:
if: ${{ ! cancelled() }}
uses: actions/upload-artifact@v4
with:
name: Torch-XPU-Wheel-${{ github.event.pull_request.number || github.sha }}
name: Torch-XPU-Wheel-${{ github.event.pull_request.number || github.sha }}-${{ env.TORCH_COMMIT_ID }}
path: ${{ github.workspace }}/torch*.whl
- name: Upload Build Log
if: ${{ ! cancelled() }}
uses: actions/upload-artifact@v4
with:
name: Torch-XPU-Build-Log-${{ github.event.pull_request.number || github.sha }}
name: Torch-XPU-Build-Log-${{ github.event.pull_request.number || github.sha }}-${{ env.TORCH_COMMIT_ID }}
path: ${{ github.workspace }}/pytorch_*.log
- name: Cleanup
if: always()
Expand Down
Loading
Loading