Skip to content

Conversation

@zhengjun-xing
Copy link

This PR includes two fixes:

  1. Fix OSError 24 (“Too many open files”) in the multi-copy benchmark by cherry-picking the latest upstream FBGEMM patch.
    Enables switching to the file_system sharing strategy via:
    export PYTORCH_SHARE_STRATEGY='file_system'.

  2. Fix missing libc10.so caused by a recent path change, which breaks functionality from PR Fix incorrect SVE2 detection in FBGEMM benchmark #290.
    Updates the expected path to the correct library location.

Zhengjun Xing added 2 commits November 19, 2025 08:58
Running benchmarks with a large number of copies can trigger:
 OSError: [Errno 24] Too many open files.

Example command:
(fbgemm_gpu_env)$ ulimit -n 1048576
(fbgemm_gpu_env)$ python ./bench/tbe/tbe_inference_benchmark.py nbit-cpu
\
    --num-embeddings=40000000 --bag-size=2 --embedding-dim=96 \
    --batch-size=162 --num-tables=8 --weights-precision=int4 \
    --output-dtype=fp32 --copies=96 --iters=30000

Cherry-pick the latest upstream commit from the FBGEMM main branch to
fix this issue.
This patch enables switching to the file_system sharing strategy via:
  export PYTORCH_SHARE_STRATEGY='file_system'
A recent change to the installation location of libc10.so causes the
following warning and breaks the functionality introduced in
facebookresearch#290:

 [WARN] Linux + aarch64 detected and target binary is tbe_inference_benchmark,
        but libc10.so not found at:
        DCPerf/benchmarks/build/miniconda/envs/fbgemm_build_oss_env/lib/python3.13/site-packages/torch/lib/libc10.so
 [WARN] Running without LD_PRELOAD

This patch updates the expected path to the correct libc10.so location.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 19, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 20, 2025

@YifanYuan3 has imported this pull request. If you are a Meta employee, you can view this in D87577418.

meta-codesync bot pushed a commit that referenced this pull request Nov 20, 2025
Summary:
This PR includes two fixes:

1. Fix OSError 24 (“Too many open files”) in the multi-copy benchmark by cherry-picking the latest upstream FBGEMM patch.
Enables switching to the file_system sharing strategy via:
export PYTORCH_SHARE_STRATEGY='file_system'.

2. Fix missing libc10.so caused by a recent path change, which breaks functionality from PR #290.
Updates the expected path to the correct library location.

Pull Request resolved: #310

Reviewed By: charles-typ

Differential Revision: D87577418

Pulled By: YifanYuan3

fbshipit-source-id: 768acca7dcfeba552aa103b125d9ac8cbd1fd68e
@YifanYuan3 YifanYuan3 closed this Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants