You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running ZFS 2.2.7 and started getting system freezes with some kind of all system deadlock on kernels 6.12.28(?)
All IO seem to stop, processes unkillable and the pool suspended due to IO errors. On next reboot - no traces of failures, so sounds like something was badly deadlocked.
System information
Type
Version/Name
Distribution Name
Gentoo
Distribution Version
Live
Kernel Version
6.12.28
Architecture
x86_64
OpenZFS Version
2.2.7
Describe the problem you're observing
The system becomes unresponsive, cannot start new programs or kill any. I can still use terminal and see new output from programs which still alive. On serial console I can see kernel messages, see below.
Describe how to reproduce the problem
I don't know yet what triggers this. I'm running a few qemu/kvm VMs with heavy use of GPU in one of them with, pool IO is minimal.
I recently upgraded the kernel from 6.12.25, 6.12.26 to 6.12.28 and started getting the system freezes.
This is the second one in a row I experienced. I rolled back to 6.12.25 for now to see if the problem persists.
Include any warning/errors/backtraces from the system logs
[67094.137926] INFO: task systemd-journal:4828 blocked for more than 122 seconds.
[67094.137946] Tainted: G T 6.12.26-gentoo-x86_64 #2
[67094.137948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67094.137973] ? prepare_to_wait_event+0x180/0x180
[67094.137978] txg_wait_synced_impl+0xf8/0x140
[67094.137983] txg_wait_synced+0x10/0x40
[67094.137986] dmu_tx_wait+0x28e/0x410
[67094.137989] ? spl_kmem_zalloc+0x106/0x120
[67094.137993] dmu_tx_assign+0x15f/0x500
[67094.137995] zfs_dirty_inode+0xd4/0x300
[67094.138000] ? dmu_read_impl+0x156/0x190
[67094.138003] ? dnode_rele_and_unlock+0x58/0xf0
[67094.138006] ? __mod_memcg_lruvec_state+0xe6/0x1b0
[67094.138010] ? __lruvec_stat_mod_folio+0x97/0x130
[67094.138012] ? folio_add_file_rmap_ptes+0x5d/0xf0
[67094.138015] zpl_dirty_inode+0x2b/0x40
[67094.138017] __mark_inode_dirty+0x56/0x370
[67094.138021] generic_update_time+0x51/0x60
[67094.138023] file_update_time+0x66/0x70
[67094.138025] filemap_page_mkwrite+0x78/0x180
[67094.138028] do_page_mkwrite+0x48/0xc0
[67094.138031] do_wp_page+0xd5/0x1050
[67094.138033] ? __pte_offset_map+0x1b/0x180
[67094.138035] __handle_mm_fault+0x863/0x8d0
[67094.138038] handle_mm_fault+0x187/0x2d0
[67094.138040] do_user_addr_fault+0x172/0x690
[67094.138043] exc_page_fault+0x78/0x170
[67094.138045] asm_exc_page_fault+0x26/0x30
[67094.138049] RIP: 0033:0x7fa2a5d38d70
[67094.138051] RSP: 002b:00007fff64ab2d80 EFLAGS: 00010246
[67094.138053] RAX: 00007fa292456990 RBX: 00005571185cbdf0 RCX: 00007fa2a5deca00
[67094.138054] RDX: 0000000000000000 RSI: 00007fa292456990 RDI: 00005571185cbdf0
[67094.138055] RBP: 00007fff64ab2e10 R08: 00000000000000a4 R09: 00005571185cbe28
[67094.138056] R10: 0000000000000001 R11: 0000f8ac5e6038ae R12: 000000000586bed0
[67094.138057] R13: 00007fff64ab39e0 R14: 0000000000463dc0 R15: 00007fff64ab2f70
[67094.138060] </TASK>
[67094.138094] INFO: task rasdaemon:6060 blocked for more than 122 seconds.
[67094.138095] Tainted: G T 6.12.26-gentoo-x86_64 #2
[67094.138096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67094.138097] task:rasdaemon state:D stack:0 pid:6060 tgid:6060 ppid:1 flags:0x00004002
[67094.138099] Call Trace:
[67094.138100] <TASK>
[67094.138102] __schedule+0x4af/0xb50
[67094.138104] schedule+0x27/0xd0
[67094.138106] schedule_preempt_disabled+0x15/0x20
[67094.138108] __mutex_lock.constprop.0+0x2e1/0x5d0
[67094.138110] txg_wait_synced_impl+0x27/0x140
[67094.138112] txg_wait_synced+0x10/0x40
[67094.138114] dmu_tx_wait+0x28e/0x410
[67094.138116] zfs_create+0x233/0x980
[67094.138118] zpl_create+0xb6/0x1a0
[67094.138120] path_openat+0x1070/0x1310
[67094.138123] do_filp_open+0x9a/0x120
[67094.138126] do_sys_openat2+0x74/0xa0
[67094.138128] __x64_sys_openat+0x49/0x80
[67094.138130] do_syscall_64+0x52/0x120
[67094.138133] entry_SYSCALL_64_after_hwframe+0x50/0x58
[67094.138136] RIP: 0033:0x7f0cfad27407
[67094.138137] RSP: 002b:00007ffec58995c0 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[67094.138139] RAX: ffffffffffffffda RBX: 00000000000a0042 RCX: 00007f0cfad27407
[67094.138142] RDX: 00000000000a0042 RSI: 0000564120411bb0 RDI: 00000000ffffff9c
[67094.138145] RBP: 0000564120411bb0 R08: 0000000000000001 R09: 0000564120411bb0
[67094.138147] R10: 00000000000001a4 R11: 0000000000000202 R12: 00000000000001a4
[67094.138149] R13: 00000000000001a4 R14: 00000000000a0042 R15: 0000000000000042
[67094.138153] </TASK>
[67094.138173] INFO: task systemd-journal:6668 blocked for more than 122 seconds.
[67094.138176] Tainted: G T 6.12.26-gentoo-x86_64 #2
[67094.138177] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[67094.138178] task:systemd-jo
[67162.263773] systemd[1]: systemd-timesyncd.service: Watchdog timeout (limit 3min)!
[67168.513988] systemd[1]: [email protected]: State 'stop-watchdog' timed out. Killing.
[67172.513678] systemd[1]: systemd-journald.service: State 'stop-watchdog' timed out. Killing.
[67252.514031] systemd[1]: systemd-timesyncd.service: State 'stop-watchdog' timed out. Killing.
[67258.764078] systemd[1]: [email protected]: Processes still around after SIGKILL. Ignoring.
[67262.764091] systemd[1]: systemd-journald.service: Processes still around after SIGKILL. Ignoring.
[67342.764461] systemd[1]: systemd-timesyncd.service: Processes still around after SIGKILL. Ignoring.
[67349.014499] systemd[1]: [email protected]: State 'final-sigterm' timed out. Killing.
[67353.014510] systemd[1]: systemd-journald.service: State 'final-sigterm' timed out. Killing.
[67371.944951] brkvm: port 3(tap1) entered disabled state
[67371.945064] tap1 (unregistering): left allmulticast mode
[67371.945069] tap1 (unregistering): left promiscuous mode
[67371.945072] brkvm: port 3(tap1) entered disabled state
[67371.961298] WARNING: Pool 'ssd' has encountered an uncorrectable I/O failure and has been suspended.
[67371.961298]
[67433.014899] systemd[1]: systemd-timesyncd.service: State 'final-sigterm' timed out. Killing.
[67439.264905] systemd[1]: [email protected]: Processes still around after final SIGKILL. Entering failed mode.
[67439.264925] systemd[1]: [email protected]: Failed with result 'watchdog'.
[67439.265483] systemd[1]: [email protected]: Found left-over process 6668 (systemd-journal) in control group while starting unit. Ignoring.
[67439.265493] systemd[1]: [email protected]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[67443.014953] systemd[1]: systemd-journald.service: Processes still around after final SIGKILL. Entering failed mode.
[67443.014971] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
[67443.016567] systemd[1]: systemd-journald.service: Found left-over process 4828 (systemd-journal) in control group while starting unit. Ignoring.
[67443.016573] systemd[1]: systemd-journald.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
The text was updated successfully, but these errors were encountered:
Not much to go on. The hung tasks are waiting for the transaction to complete, which of course won't if the pool suspends.
Since the pool suspends, I'm curious what zpool events reports. And zpool status, to check the error counts (sounds like not much).
The gap between those warnings and the suspend suggests anywhere between 5 and 7 minutes between the IO being issued, and the pool suspending. Suggests a timeout somewhere, either the kernel returning the IO, or a ZFS timeout, maybe the deadman? Anything in dbgmsg either?
The logs look like you're doing a shutdown, are you? Or is this happening in "normal" operation as well?
What's your setting for zfs_vdev_disk_classic?
Kernel versions raise a slight eyebrow. If this were reproduceable on 6.12.25, but not on 6.12.24, I'd wonder if it's related to #17277. If it were, a variation of #17298 might help.
zfs_vdev_disk_classic is 1
It is also not a shutdown. It is during normal operations! The pool consists of two SSDs and scrub returned clean.
Thanks for pointers, I'll keep an eye on 'zpool events' and 'dbgmsg'.
I believe the problematic kernel is 6.12.28. If anyone experience something like this, let us know :)
I am running ZFS 2.2.7 and started getting system freezes with some kind of all system deadlock on kernels 6.12.28(?)
All IO seem to stop, processes unkillable and the pool suspended due to IO errors. On next reboot - no traces of failures, so sounds like something was badly deadlocked.
System information
Describe the problem you're observing
The system becomes unresponsive, cannot start new programs or kill any. I can still use terminal and see new output from programs which still alive. On serial console I can see kernel messages, see below.
Describe how to reproduce the problem
I don't know yet what triggers this. I'm running a few qemu/kvm VMs with heavy use of GPU in one of them with, pool IO is minimal.
I recently upgraded the kernel from 6.12.25, 6.12.26 to 6.12.28 and started getting the system freezes.
This is the second one in a row I experienced. I rolled back to 6.12.25 for now to see if the problem persists.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: