On Ubuntu 22.04, kernel 6.7.9, ROCm 6.1.0 (RC), Radeon Pro VII.
In brief: when creating a second command-queue (that is not even used at all) one thread starts eating 100% CPU, i.e. doing busy-wait. The performance of the other command queue is impacted as well.
Below, I'm going to compare the "normal" situation that is observed when using a single command-queue, vs what is observed when creating a second command queue (hot-loop).
Using a plain mono-threaded OpenCL app with a single host command-queue (in-order, with profiling enabled), with the queue accessed only from the main thread, this is the thread layout that I see ("the normal")
(gdb) i th
Id Target Id Frame
* 1 Thread 0x7ffff7c5a080 (LWP 142088) "prpll-amd" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd450, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
2 Thread 0x7fffee200640 (LWP 142091) "prpll-amd" __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff797d640 (LWP 142094) "prpll-amd" __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
(gdb) bt
#0 __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd450, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x7fffffffd450, clockid=0, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555691068, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7fffffffd450, private=<optimized out>) at ./nptl/futex-internal.c:139
#3 0x00007ffff769c9b0 in do_futex_wait (sem=sem@entry=0x555555691068, abstime=abstime@entry=0x7fffffffd450, clockid=0) at ./nptl/sem_waitcommon.c:111
#4 0x00007ffff769ca53 in __new_sem_wait_slow64 (sem=0x555555691068, abstime=0x7fffffffd450, clockid=0) at ./nptl/sem_waitcommon.c:183
#5 0x00007ffff7ecadd1 in ?? () from /opt/rocm/lib/libamdocl64.so
#6 0x00007ffff7ecab2e in ?? () from /opt/rocm/lib/libamdocl64.so
#7 0x00007ffff7eb2298 in ?? () from /opt/rocm/lib/libamdocl64.so
#8 0x00007ffff7eb64c2 in ?? () from /opt/rocm/lib/libamdocl64.so
#9 0x00007ffff7e6c1f9 in clFinish () from /opt/rocm/lib/libamdocl64.so
#10 0x00005555555b4cb3 in finish (q=0x555555940260) at src/clwrap.cpp:296
#11 0x00005555555669ef in Queue::finish (this=0x5555556a2660) at src/Queue.cpp:76
#12 0x000055555559fec6 in Gpu::isPrimePRP (this=0x555555940750, args=..., task=...) at src/Gpu.cpp:1145
#13 0x00005555555bbde3 in Task::execute (this=0x7fffffffd9b0, context=..., args=...) at src/Task.cpp:163
#14 0x00005555555935d2 in threadWorker (args=..., context=..., instance=1) at src/main.cpp:25
#15 0x0000555555593b70 in main (argc=7, argv=0x7fffffffdec8) at src/main.cpp:78
(gdb) t 2
[Switching to thread 2 (Thread 0x7fffee200640 (LWP 142091))]
#0 __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
36 ../sysdeps/unix/sysv/linux/ioctl.c: No such file or directory.
(gdb) bt
#0 __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1 0x00007fffee91a560 in kmtIoctl (fd=4, request=request@entry=3222817548, arg=arg@entry=0x7fffee1f8220) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/libhsakmt.c:13
#2 0x00007fffee913330 in hsaKmtWaitOnMultipleEvents_Ext (event_age=0x7fffee1f82d0, Milliseconds=4294967294, WaitOnAll=false, NumEvents=<optimized out>, Events=0x7fffee1f8350) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:409
#3 hsaKmtWaitOnMultipleEvents_Ext (Events=0x7fffee1f8350, NumEvents=4, WaitOnAll=<optimized out>, Milliseconds=4294967294, event_age=0x7fffee1f82d0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:380
#4 0x00007fffee89c70c in rocr::core::Signal::WaitAny (signal_count=signal_count@entry=9, hsa_signals=hsa_signals@entry=0x7fffe8001150, conds=conds@entry=0x7fffe8000cd0, values=values@entry=0x7fffe80011e0, timeout=timeout@entry=18446744073709551615, wait_hint=<optimized out>, wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=<optimized out>)
at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/signal.cpp:324
#5 0x00007fffee8613f6 in rocr::AMD::hsa_amd_signal_wait_any (signal_count=9, hsa_signals=0x7fffe8001150, conds=0x7fffe8000cd0, values=0x7fffe80011e0, timeout_hint=timeout_hint@entry=18446744073709551615, wait_hint=wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=0x7fffee1f8488)
at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa_ext_amd.cpp:587
#6 0x00007fffee879aca in rocr::core::Runtime::AsyncEventsLoop () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:1473
#7 0x00007fffee82a85b in rocr::os::ThreadTrampoline (arg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/lnx/os_linux.cpp:81
#8 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#9 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 5
[Switching to thread 5 (Thread 0x7ffff797d640 (LWP 142094))]
#0 __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
36 in ../sysdeps/unix/sysv/linux/ioctl.c
(gdb) bt
#0 __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1 0x00007fffee91a560 in kmtIoctl (fd=4, request=request@entry=3222817548, arg=arg@entry=0x7ffff79751c0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/libhsakmt.c:13
#2 0x00007fffee913330 in hsaKmtWaitOnMultipleEvents_Ext (event_age=0x7ffff79752b8, Milliseconds=4294967294, WaitOnAll=true, NumEvents=<optimized out>, Events=0x7ffff7975278) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:409
#3 hsaKmtWaitOnMultipleEvents_Ext (Events=0x7ffff7975278, NumEvents=1, WaitOnAll=<optimized out>, Milliseconds=4294967294, event_age=0x7ffff79752b8) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:380
#4 0x00007fffee913adb in hsaKmtWaitOnEvent_Ext (Event=<optimized out>, Milliseconds=<optimized out>, event_age=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:226
#5 0x00007fffee8655d0 in rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9e027bd40, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:243
#6 0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#7 0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#8 0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#9 0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#12 0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#13 0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#14 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#15 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#16 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
When adding a second command queue (but without using it at all), the thread layout becomes:
(gdb) i th
Id Target Id Frame
* 1 Thread 0x7ffff7c5a080 (LWP 141830) "prpll-amd" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd440, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
2 Thread 0x7fffee200640 (LWP 141833) "prpll-amd" __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
5 Thread 0x7ffff797d640 (LWP 141836) "prpll-amd" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7ffff7975330, op=393, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:57
6 Thread 0x7ffff790f640 (LWP 141837) "prpll-amd" 0x00007fffee865451 in rocr::__rdtsc () at /usr/lib/gcc/x86_64-linux-gnu/11/include/ia32intrin.h:114
(gdb) bt
#0 __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd440, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x7fffffffd440, clockid=0, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555691068, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7fffffffd440, private=<optimized out>) at ./nptl/futex-internal.c:139
#3 0x00007ffff769c9b0 in do_futex_wait (sem=sem@entry=0x555555691068, abstime=abstime@entry=0x7fffffffd440, clockid=0) at ./nptl/sem_waitcommon.c:111
#4 0x00007ffff769ca53 in __new_sem_wait_slow64 (sem=0x555555691068, abstime=0x7fffffffd440, clockid=0) at ./nptl/sem_waitcommon.c:183
#5 0x00007ffff7ecadd1 in ?? () from /opt/rocm/lib/libamdocl64.so
#6 0x00007ffff7ecab2e in ?? () from /opt/rocm/lib/libamdocl64.so
#7 0x00007ffff7eb2298 in ?? () from /opt/rocm/lib/libamdocl64.so
#8 0x00007ffff7eb64c2 in ?? () from /opt/rocm/lib/libamdocl64.so
#9 0x00007ffff7e6c1f9 in clFinish () from /opt/rocm/lib/libamdocl64.so
#10 0x00005555555b4cff in finish (q=0x5555556a26e0) at src/clwrap.cpp:296
#11 0x00005555555669ef in Queue::finish (this=0x555555944ed0) at src/Queue.cpp:76
#12 0x000055555559ff12 in Gpu::isPrimePRP (this=0x555555940b10, args=..., task=...) at src/Gpu.cpp:1145
#13 0x00005555555bbe2f in Task::execute (this=0x7fffffffd9a0, context=..., args=...) at src/Task.cpp:163
#14 0x00005555555935d2 in threadWorker (args=..., context=..., instance=1) at src/main.cpp:25
#15 0x0000555555593b92 in main (argc=7, argv=0x7fffffffdec8) at src/main.cpp:78
(gdb) t 2
[Switching to thread 2 (Thread 0x7fffee200640 (LWP 141833))]
#0 __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
36 ../sysdeps/unix/sysv/linux/ioctl.c: No such file or directory.
(gdb) bt
#0 __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1 0x00007fffee91a560 in kmtIoctl (fd=4, request=request@entry=3222817548, arg=arg@entry=0x7fffee1f8220) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/libhsakmt.c:13
#2 0x00007fffee913330 in hsaKmtWaitOnMultipleEvents_Ext (event_age=0x7fffee1f82d0, Milliseconds=4294967294, WaitOnAll=false, NumEvents=<optimized out>, Events=0x7fffe80015d0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:409
#3 hsaKmtWaitOnMultipleEvents_Ext (Events=0x7fffe80015d0, NumEvents=4, WaitOnAll=<optimized out>, Milliseconds=4294967294, event_age=0x7fffee1f82d0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:380
#4 0x00007fffee89c70c in rocr::core::Signal::WaitAny (signal_count=signal_count@entry=11, hsa_signals=hsa_signals@entry=0x7fffe80011f0, conds=conds@entry=0x7fffe8000d70, values=values@entry=0x7fffe8001280, timeout=timeout@entry=18446744073709551615, wait_hint=<optimized out>, wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=<optimized out>)
at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/signal.cpp:324
#5 0x00007fffee8613f6 in rocr::AMD::hsa_amd_signal_wait_any (signal_count=11, hsa_signals=0x7fffe80011f0, conds=0x7fffe8000d70, values=0x7fffe8001280, timeout_hint=timeout_hint@entry=18446744073709551615, wait_hint=wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=0x7fffee1f8488)
at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa_ext_amd.cpp:587
#6 0x00007fffee879aca in rocr::core::Runtime::AsyncEventsLoop () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:1473
#7 0x00007fffee82a85b in rocr::os::ThreadTrampoline (arg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/lnx/os_linux.cpp:81
#8 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#9 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 3
Unknown thread 3.
(gdb) t 5
[Switching to thread 5 (Thread 0x7ffff797d640 (LWP 141836))]
#0 __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7ffff7975330, op=393, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:57
57 ./nptl/futex-internal.c: No such file or directory.
(gdb) bt
#0 __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7ffff7975330, op=393, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x7ffff7975330, clockid=0, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5555559406a8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7ffff7975330, private=<optimized out>) at ./nptl/futex-internal.c:139
#3 0x00007ffff769c9b0 in do_futex_wait (sem=sem@entry=0x5555559406a8, abstime=abstime@entry=0x7ffff7975330, clockid=0) at ./nptl/sem_waitcommon.c:111
#4 0x00007ffff769ca53 in __new_sem_wait_slow64 (sem=0x5555559406a8, abstime=0x7ffff7975330, clockid=0) at ./nptl/sem_waitcommon.c:183
#5 0x00007ffff7ecadd1 in ?? () from /opt/rocm/lib/libamdocl64.so
#6 0x00007ffff7ecab2e in ?? () from /opt/rocm/lib/libamdocl64.so
#7 0x00007ffff7eb5d58 in ?? () from /opt/rocm/lib/libamdocl64.so
#8 0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9 0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 6
[Switching to thread 6 (Thread 0x7ffff790f640 (LWP 141837))]
#0 0x00007fffee865451 in rocr::__rdtsc () at /usr/lib/gcc/x86_64-linux-gnu/11/include/ia32intrin.h:114
114 return __builtin_ia32_rdtsc ();
(gdb) bt
#0 0x00007fffee865451 in rocr::__rdtsc () at /usr/lib/gcc/x86_64-linux-gnu/11/include/ia32intrin.h:114
#1 rocr::timer::fast_clock::raw_now () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/timer.h:149
#2 rocr::timer::fast_clock::now () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/timer.h:140
#3 rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d803bbe0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:211
#4 0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#5 0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#6 0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#7 0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#8 0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#9 0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#12 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#13 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
The problem is created by the last thread above (6) which is eating 100% CPU being caugh up in a hot loop. To see a few other points in this loop:
#0 0x00007fffee8655bf in rocr::timer::duration_cast<std::chrono::duration<long, std::ratio<1, 1000> >, double, std::ratio<1, 1000000000000> > (d=...) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/timer.h:78
#1 rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d80ac890, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:240
#2 0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#3 0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#4 0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#5 0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#6 0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#7 0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#8 0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9 0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
#0 rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d801adf0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /usr/include/c++/11/chrono:521
#1 0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#2 0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#3 0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#4 0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#5 0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#6 0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#7 0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#8 0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#9 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#11 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
All the command queues are in-order and with profiling enabled.
The situation is pretty severe, basically precluding the possibility of using more than one command queue.
On Ubuntu 22.04, kernel 6.7.9, ROCm 6.1.0 (RC), Radeon Pro VII.
In brief: when creating a second command-queue (that is not even used at all) one thread starts eating 100% CPU, i.e. doing busy-wait. The performance of the other command queue is impacted as well.
Below, I'm going to compare the "normal" situation that is observed when using a single command-queue, vs what is observed when creating a second command queue (hot-loop).
Using a plain mono-threaded OpenCL app with a single host command-queue (in-order, with profiling enabled), with the queue accessed only from the main thread, this is the thread layout that I see ("the normal")
When adding a second command queue (but without using it at all), the thread layout becomes:
The problem is created by the last thread above (6) which is eating 100% CPU being caugh up in a hot loop. To see a few other points in this loop:
All the command queues are in-order and with profiling enabled.
The situation is pretty severe, basically precluding the possibility of using more than one command queue.