Skip to content

enable distributed cases based on local branch #1538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 94 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
a2c3f35
enable fsdp cases based on local branch
daisyden Apr 2, 2025
e772d23
add 2025.0 WA
daisyden Apr 3, 2025
cbd34cd
Update distributed UT cases in DDP and PP
PenghuiCheng Apr 3, 2025
d856e95
Fixed pylint error
PenghuiCheng Apr 3, 2025
28a259e
Fixed pylint error
PenghuiCheng Apr 3, 2025
62e9ff7
add distributed ut in CI
zxd1997066 Apr 5, 2025
119d2fb
update if condition
zxd1997066 Apr 5, 2025
5ff20ba
keep_torch_xpu_ops
zxd1997066 Apr 5, 2025
cc472d7
update keyword in distributed ut check
zxd1997066 Apr 6, 2025
60dbd6e
update pytorch build
zxd1997066 Apr 7, 2025
af0bca9
enable fsdp cases based on local branch
daisyden Apr 2, 2025
6885a00
add 2025.0 WA
daisyden Apr 3, 2025
cd013d7
Update distributed UT cases in DDP and PP
PenghuiCheng Apr 3, 2025
cd92f23
Fixed pylint error
PenghuiCheng Apr 3, 2025
413c2b0
Fixed pylint error
PenghuiCheng Apr 3, 2025
ab68eee
add distributed ut in CI
zxd1997066 Apr 5, 2025
c5ec140
update if condition
zxd1997066 Apr 5, 2025
edc9e1b
keep_torch_xpu_ops
zxd1997066 Apr 5, 2025
6c9e99a
update keyword in distributed ut check
zxd1997066 Apr 6, 2025
bdfa853
update pytorch build
zxd1997066 Apr 7, 2025
0e77f30
update if condition
zxd1997066 Apr 7, 2025
faf4a7f
Merge branch 'main' of https://github.com/intel/torch-xpu-ops into da…
daisyden Apr 8, 2025
4076a1a
resolve Artifact name conflict
zxd1997066 Apr 7, 2025
5596ac4
enabled test_sharder.py on xpu
daisyden Apr 8, 2025
2ed7973
Enabled UT for test/distributed/tensor
PenghuiCheng Apr 9, 2025
8b63191
Merge from daisyden/distributed_2.8 branch
PenghuiCheng Apr 9, 2025
5bab858
add FSDP2 cases, improved check-ut.py for summary, do ZE_AFFINITY_MAS…
daisyden Apr 10, 2025
f1b824d
Skip test_schedule_multiproc.py for hang error
PenghuiCheng Apr 10, 2025
2a47caf
Merge branch 'daisyden/distributed_2.8' of https://github.com/intel/t…
PenghuiCheng Apr 10, 2025
f696faa
refine error log for test files without pytest
PenghuiCheng Apr 15, 2025
e9ace29
Merge remote-tracking branch 'origin/daisyden/distributed_2.8' into d…
PenghuiCheng Apr 15, 2025
00326ac
Fixed error for create log file without pytest
PenghuiCheng Apr 15, 2025
59c609e
Skipped cases rasied issue
PenghuiCheng Apr 16, 2025
b5eba76
Merge remote-tracking branch 'origin/daisyden/distributed_2.8' into d…
PenghuiCheng Apr 16, 2025
ff926e3
Merge remote-tracking branch 'origin/main' into daisyden/distributed_2.8
PenghuiCheng Apr 16, 2025
de00feb
Update ut summary
RUIJIEZHONG66166 Apr 16, 2025
f0e1128
align the path
RUIJIEZHONG66166 Apr 16, 2025
4c3651e
update ut
zxd1997066 Apr 16, 2025
6f635a7
add distributed ut summary
RUIJIEZHONG66166 Apr 16, 2025
e9b1ba9
fix lint issue
zxd1997066 Apr 16, 2025
526c0a6
Merge branch 'daisyden/distributed_2.8' of https://github.com/intel/t…
RUIJIEZHONG66166 Apr 16, 2025
14773da
fix lint issue
zxd1997066 Apr 16, 2025
5d9d94b
fix lint issue
zxd1997066 Apr 16, 2025
5197d87
update
zxd1997066 Apr 16, 2025
be64dbe
update
zxd1997066 Apr 16, 2025
0e44577
update
zxd1997066 Apr 16, 2025
d0a0609
comment pdb
zxd1997066 Apr 17, 2025
65d1953
align the path
RUIJIEZHONG66166 Apr 17, 2025
415abe7
Skipped error cases
PenghuiCheng Apr 18, 2025
4bedfb6
merge from daisyden/distributed_2.8
PenghuiCheng Apr 18, 2025
c555fbb
fixed lint error
PenghuiCheng Apr 18, 2025
6d6a75e
fixed lint error
PenghuiCheng Apr 18, 2025
1f451b2
Add some UT cases
PenghuiCheng Apr 24, 2025
d5a84ca
merge from main branch
PenghuiCheng Apr 24, 2025
b2c5875
Add UT cases for _shard and _tools folder
PenghuiCheng Apr 29, 2025
177d7c0
Clean skip list
PenghuiCheng May 5, 2025
4ca9f70
Merge remote-tracking branch 'origin/main' into daisyden/distributed_2.8
PenghuiCheng May 5, 2025
939352d
clean skip list for distributed
PenghuiCheng May 15, 2025
1533b9b
Add comments for skip list
PenghuiCheng May 15, 2025
d74615f
move some issues from skip list to known issues report
PenghuiCheng May 16, 2025
7493676
enable fsdp cases based on local branch
daisyden Apr 2, 2025
7d5e6a9
add 2025.0 WA
daisyden Apr 3, 2025
565d86a
Update distributed UT cases in DDP and PP
PenghuiCheng Apr 3, 2025
9d0ddfe
Fixed pylint error
PenghuiCheng Apr 3, 2025
9cf12d5
Fixed pylint error
PenghuiCheng Apr 3, 2025
8705ec5
add distributed ut in CI
zxd1997066 Apr 5, 2025
4b94ee2
update if condition
zxd1997066 Apr 5, 2025
e52ae48
keep_torch_xpu_ops
zxd1997066 Apr 5, 2025
0fc4430
update pytorch build
zxd1997066 Apr 7, 2025
45dfc65
Enabled UT for test/distributed/tensor
PenghuiCheng Apr 9, 2025
ebbce64
enable fsdp cases based on local branch
daisyden Apr 2, 2025
6e54fb8
add distributed ut in CI
zxd1997066 Apr 5, 2025
5659efd
update if condition
zxd1997066 Apr 5, 2025
596f231
update pytorch build
zxd1997066 Apr 7, 2025
2b958e1
update if condition
zxd1997066 Apr 7, 2025
5d9a340
resolve Artifact name conflict
zxd1997066 Apr 7, 2025
3fd92a3
add FSDP2 cases, improved check-ut.py for summary, do ZE_AFFINITY_MAS…
daisyden Apr 10, 2025
137272c
Skipped error cases
PenghuiCheng Apr 18, 2025
5cd5b16
update ut
zxd1997066 Apr 16, 2025
0789c99
add distributed ut summary
RUIJIEZHONG66166 Apr 16, 2025
4f6bd8d
align the path
RUIJIEZHONG66166 Apr 16, 2025
bc65a51
update
zxd1997066 Apr 16, 2025
a86dc57
update
zxd1997066 Apr 16, 2025
0cdcce8
update
zxd1997066 Apr 16, 2025
e894f69
align the path
RUIJIEZHONG66166 Apr 17, 2025
a286091
fix yml
zxd1997066 May 16, 2025
97504b4
remove invalid case
PenghuiCheng May 21, 2025
8a72f4f
merge from daisyden/distributed_2.8
PenghuiCheng May 21, 2025
5ca55a9
Use python instead of pytest to run test_c10d_functional_native.py
PenghuiCheng May 22, 2025
b311e71
Merge branch 'main' into daisyden/distributed_2.8
daisyden May 23, 2025
6387670
enable libfrabric WA
daisyden May 23, 2025
219de35
Add accuracy issue to skip list
PenghuiCheng May 27, 2025
ea3f6f1
Merge branch 'main' into daisyden/distributed_2.8
daisyden May 30, 2025
92b3cad
fix skip_list_dist_local.py typo
zxd1997066 May 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/scripts/check-ut.py
Original file line number Diff line number Diff line change
Expand Up @@ -261,4 +261,4 @@ def main():


if __name__ == "__main__":
main()
main()
18 changes: 9 additions & 9 deletions .github/scripts/ut_result_check.sh
Original file line number Diff line number Diff line change
Expand Up @@ -133,24 +133,24 @@ if [[ "${ut_suite}" == 'torch_xpu' ]]; then
echo -e "[PASS] UT ${ut_suite} test Pass"
fi
fi
if [[ "${ut_suite}" == 'xpu_distributed' ]]; then
grep -E "^FAILED" xpu_distributed_test.log | awk '{print $2}' > ./"${ut_suite}"_xpu_distributed_test_failed.log
grep -E "have failures" xpu_distributed_test.log | awk '{print $1}' >> ./"${ut_suite}"_xpu_distributed_test_failed.log
compare_and_filter_logs "${ut_suite}"_xpu_distributed_test_failed.log Known_issue.log
if [[ -f "${ut_suite}_xpu_distributed_test_failed_filtered.log" ]]; then
num_failed_xpu_distributed=$(wc -l < "./${ut_suite}_xpu_distributed_test_failed_filtered.log")
if [[ "${ut_suite}" == 'xpu_distributed' || "${ut_suite}" == 'pytorch_distributed' ]]; then
grep -E "^FAILED" "${ut_suite}"_test.log | awk '{print $2}' > ./"${ut_suite}"_test_failed.log
grep -E "have failures" "${ut_suite}"_test.log | awk '{print $1}' >> ./"${ut_suite}"_test_failed.log
compare_and_filter_logs "${ut_suite}"_test_failed.log Known_issue.log
if [[ -f "${ut_suite}_test_failed_filtered.log" ]]; then
num_failed_xpu_distributed=$(wc -l < "./${ut_suite}_test_failed_filtered.log")
else
num_failed_xpu_distributed=$(wc -l < "./${ut_suite}_xpu_distributed_test_failed.log")
num_failed_xpu_distributed=$(wc -l < "./${ut_suite}_test_failed.log")
fi
echo -e "========================================================================="
echo -e "Show Failed cases in ${ut_suite} xpu distributed"
echo -e "========================================================================="
cat "./${ut_suite}_xpu_distributed_test_failed.log"
cat "./${ut_suite}_test_failed.log"
((num_failed=num_failed_xpu_distributed))
if [[ $num_failed -gt 0 ]]; then
echo -e "[ERROR] UT ${ut_suite} test Fail"
exit 1
else
echo -e "[PASS] UT ${ut_suite} test Pass"
fi
fi
fi
33 changes: 29 additions & 4 deletions .github/workflows/_linux_build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,33 @@ jobs:
pip install -U pip wheel setuptools
- name: Checkout torch-xpu-ops
uses: actions/checkout@v4
with:
path: torch-xpu-ops
- name: Prepare Stock Pytorch
run: |
pwd
which conda && conda clean -ay
conda remove --all -y -n xpu_build || \
rm -rf $(dirname ${CONDA_EXE})/../envs/xpu_build
conda create -n xpu_build python=${{ inputs.python }} cmake=3.28 ninja -y
source activate xpu_build
cd ../ && sudo rm -rf pytorch
pip install requests
if [[ ${{ inputs.pytorch }} == 'distributed_2.8' ]]; then
git clone https://github.com/daisyden/pytorch.git pytorch
else
git clone https://github.com/pytorch/pytorch pytorch
fi
cd pytorch && git checkout $(echo ${{ inputs.pytorch }})
# apply PRs for stock pytorch
python ../torch-xpu-ops/.github/scripts/apply_torch_pr.py
git status && git show -s
git submodule sync && git submodule update --init --recursive
if [[ ${{ inputs.keep_torch_xpu_ops }} == 'true' ]]; then
echo "Don't replace torch-xpu-ops!"
else
rm -rf third_party/torch-xpu-ops && cp -r ../torch-xpu-ops third_party/
# Workaround for torch-xpu-ops ci test
sed -i "s/checkout --quiet \${TORCH_XPU_OPS_COMMIT}/log -n 1/g" caffe2/CMakeLists.txt
fi
- name: Build Pytorch XPU
run: |
set -xe -o pipefail
Expand Down Expand Up @@ -122,13 +147,13 @@ jobs:
if: ${{ ! cancelled() }}
uses: actions/upload-artifact@v4
with:
name: Torch-XPU-Wheel-${{ github.event.pull_request.number || github.sha }}
name: Torch-XPU-Wheel-${{ github.event.pull_request.number || github.sha }}-${{ env.TORCH_COMMIT_ID }}
path: ${{ github.workspace }}/torch*.whl
- name: Upload Build Log
if: ${{ ! cancelled() }}
uses: actions/upload-artifact@v4
with:
name: Torch-XPU-Build-Log-${{ github.event.pull_request.number || github.sha }}
name: Torch-XPU-Build-Log-${{ github.event.pull_request.number || github.sha }}-${{ env.TORCH_COMMIT_ID }}
path: ${{ github.workspace }}/pytorch_*.log
- name: Cleanup
if: always()
Expand Down
Loading