-
Notifications
You must be signed in to change notification settings - Fork 1.1k
src, tests: align synchronous Eigen threadpool implementation #4135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add Tenforflow impl of our threadpool_iface in third_party directly, so that:
- we get rid of this dependency on env
- we get rid of naive threadpool_iface impl in our testing.
2255cc7 to
ef71960
Compare
ef71960 to
eb52f85
Compare
|
Hi @Sqvid, I want to get your attention on this PR. It changes the implementation of Eigen threadpool in validation and makes in actually synchronous, meaning there'll be no synchronization inside the library and this spot in the original TF threadpool implementation will lead to crashes due to the wrong order of decreasing counter and doing jobs (compare to the same spot in this PR which copies the code almost exactly). I got insights this is a product of ACL integration. Once this PR lands oneDNN from main can't be used for legacy TF runtime until the threadpool implementation gets updated. This is just for your information, feel free to pass this information to interested parties. Thanks. |
tests/CMakeLists.txt
Outdated
| list(APPEND LIBTHREADPOOL absl::synchronization) | ||
| message(STATUS "Found abseil-cpp: ${PACKAGE_PREFIX_DIR}") | ||
| endif() | ||
| endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put this alongside Eigen findpackage call in cmake/Threadpool.cmake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got insights this is a product of ACL integration. Once this PR lands oneDNN from main can't be used for legacy TF runtime until the threadpool implementation gets updated.
This is an issue not just for ACL right but also non-ACL paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be true, yes, the thing is the team we are working with are aware of that and have modifications in private repo for that path (along with async runtime support for XLA), but for your team it wouldn't be the case, so I wanted to drag your attention.
In best case scenario if no plans to touch it, threadpool impl will get updated by v3.11 update (hopefully) and won't appear at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put this alongside Eigen findpackage call in cmake/Threadpool.cmake?
Moved, and switched to EXTRA_STATIC_LIBS instead, and added that to register_exe as a single spot. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Dmitry,
have modifications in private repo for that path
I am assuming these are modifications to this spot that will make TensorFlow compatible with the changes in this PR. I am not sure why that would be different for ACL path?
From my understanding, for oneDNN-ACL primitive we dont have async runtime support yet and we will rely on [1, 2] to continue leveraging onednn in XLA, until we have async support in the ACL path.
For TensorFlow, we would expect onednn threadpool to ensure synchronization - I am assuming that we will have separate logic for both TF and XLA within onednn_threadpool.h to support both threadpool. I might be incorrect about this assumption!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is parallel call delivers synchronization as of now until this PR gets merged. Synchronous TF legacy threadpool is marked as asynchronous, and that's how it plays together and remain synchronized.
On our end we change parallel to stop doing any synchronization, and this responsibility is expected to be provided by a threadpool implementation.
From the support perspective, there's no difference for x64 or aarch64/ACL path, it's just there's gonna be a time lag between updating legacy threadpool on TF side by Intel TF team to become truly synchronous and oneDNN main branch state where synchronization is removed.
During that time window ACL is exposed to the bug I mentioned as Intel TF doesn't do anything to support ACL (AFAIK). If you don't do active development on a main branch, sitting tight and waiting for changes to arrive would be fine, no actions required in that case. If you do develop on top of main, you want those lines swapped on your end to have it working properly.
I just wanted to warn you in case you are doing active development over oneDNN main within TF assuming you have your own testing infrastructure and procedures so that expected crushes don't stun your development.
|
@dzarukin, thank you for the heads up. I have brought this to the attention of our team. Much appreciated. |
eb52f85 to
c494f27
Compare
c494f27 to
1c1ed24
Compare
Aligns the implementation with the one from TensorFlow legacy runtime and adjusts the internals as they should be for truly synchronous threadpool.
This action introduces a dependency on abseil-cpp in Eigen threadpool implementation.
Testing binaries received an extra lib dependency across different CMake files. It will be re-used for asynchronous runtime.