Skip to content

Conversation

@dzarukin
Copy link
Contributor

Aligns the implementation with the one from TensorFlow legacy runtime and adjusts the internals as they should be for truly synchronous threadpool.

This action introduces a dependency on abseil-cpp in Eigen threadpool implementation.

Testing binaries received an extra lib dependency across different CMake files. It will be re-used for asynchronous runtime.

@dzarukin dzarukin requested a review from a team as a code owner October 14, 2025 05:16
@github-actions github-actions bot added component:tests Codeowner: @oneapi-src/onednn-arch component:common labels Oct 14, 2025
Copy link
Contributor

@mgouicem mgouicem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add Tenforflow impl of our threadpool_iface in third_party directly, so that:

  • we get rid of this dependency on env
  • we get rid of naive threadpool_iface impl in our testing.

@dzarukin dzarukin force-pushed the dzarukin/update_eigen_threadpool branch from 2255cc7 to ef71960 Compare October 14, 2025 16:55
@dzarukin dzarukin requested a review from a team as a code owner October 14, 2025 16:55
@github-actions github-actions bot added the component:graph-api Codeowner: @oneapi-src/onednn-graph label Oct 14, 2025
@dzarukin dzarukin force-pushed the dzarukin/update_eigen_threadpool branch from ef71960 to eb52f85 Compare October 14, 2025 19:11
@dzarukin
Copy link
Contributor Author

Hi @Sqvid, I want to get your attention on this PR. It changes the implementation of Eigen threadpool in validation and makes in actually synchronous, meaning there'll be no synchronization inside the library and this spot in the original TF threadpool implementation will lead to crashes due to the wrong order of decreasing counter and doing jobs (compare to the same spot in this PR which copies the code almost exactly).

I got insights this is a product of ACL integration. Once this PR lands oneDNN from main can't be used for legacy TF runtime until the threadpool implementation gets updated.

This is just for your information, feel free to pass this information to interested parties. Thanks.

list(APPEND LIBTHREADPOOL absl::synchronization)
message(STATUS "Found abseil-cpp: ${PACKAGE_PREFIX_DIR}")
endif()
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this alongside Eigen findpackage call in cmake/Threadpool.cmake?

Copy link
Contributor

@cfRod cfRod Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got insights this is a product of ACL integration. Once this PR lands oneDNN from main can't be used for legacy TF runtime until the threadpool implementation gets updated.

This is an issue not just for ACL right but also non-ACL paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be true, yes, the thing is the team we are working with are aware of that and have modifications in private repo for that path (along with async runtime support for XLA), but for your team it wouldn't be the case, so I wanted to drag your attention.

In best case scenario if no plans to touch it, threadpool impl will get updated by v3.11 update (hopefully) and won't appear at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this alongside Eigen findpackage call in cmake/Threadpool.cmake?

Moved, and switched to EXTRA_STATIC_LIBS instead, and added that to register_exe as a single spot. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Dmitry,

have modifications in private repo for that path

I am assuming these are modifications to this spot that will make TensorFlow compatible with the changes in this PR. I am not sure why that would be different for ACL path?

From my understanding, for oneDNN-ACL primitive we dont have async runtime support yet and we will rely on [1, 2] to continue leveraging onednn in XLA, until we have async support in the ACL path.

For TensorFlow, we would expect onednn threadpool to ensure synchronization - I am assuming that we will have separate logic for both TF and XLA within onednn_threadpool.h to support both threadpool. I might be incorrect about this assumption!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is parallel call delivers synchronization as of now until this PR gets merged. Synchronous TF legacy threadpool is marked as asynchronous, and that's how it plays together and remain synchronized.

On our end we change parallel to stop doing any synchronization, and this responsibility is expected to be provided by a threadpool implementation.

From the support perspective, there's no difference for x64 or aarch64/ACL path, it's just there's gonna be a time lag between updating legacy threadpool on TF side by Intel TF team to become truly synchronous and oneDNN main branch state where synchronization is removed.
During that time window ACL is exposed to the bug I mentioned as Intel TF doesn't do anything to support ACL (AFAIK). If you don't do active development on a main branch, sitting tight and waiting for changes to arrive would be fine, no actions required in that case. If you do develop on top of main, you want those lines swapped on your end to have it working properly.

I just wanted to warn you in case you are doing active development over oneDNN main within TF assuming you have your own testing infrastructure and procedures so that expected crushes don't stun your development.

@Sqvid
Copy link
Contributor

Sqvid commented Oct 15, 2025

@dzarukin, thank you for the heads up. I have brought this to the attention of our team. Much appreciated.

@dzarukin dzarukin force-pushed the dzarukin/update_eigen_threadpool branch from eb52f85 to c494f27 Compare October 15, 2025 19:10
@github-actions github-actions bot added component:build and removed component:graph-api Codeowner: @oneapi-src/onednn-graph labels Oct 15, 2025
@dzarukin dzarukin force-pushed the dzarukin/update_eigen_threadpool branch from c494f27 to 1c1ed24 Compare October 27, 2025 03:14
@dzarukin dzarukin merged commit 236d662 into main Oct 27, 2025
29 checks passed
@dzarukin dzarukin deleted the dzarukin/update_eigen_threadpool branch October 27, 2025 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants