ipa-replica-install randomly hangs #6644

flo-renaud · 2025-02-28T12:55:19Z

Issue Description
The installation of an IPA replica sometimes hangs in a LDAP operation.

Package Version and Platform:
Fedora 41
389-ds-base-3.1.2-3.fc41.x86_64
krb5-server-1.21.3-4.fc41.x86_64
freeipa-server-4.13.0.dev202502271132+git-0.fc41.x86_64

Steps to Reproduce
Steps to reproduce the behavior:

install master ipa-server-install -n ufreeipa.test -r UFREEIPA.TEST -p Secret.123 -a Secret.123 --domain-level=1 --dirsrv-config-file /root/ipatests/ipatests_dse.ldif -U --setup-dns --forwarder 10.11.5.19 --auto-reverse
install replica with ipa-replica-install --admin-password Secret.123 --password Secret.123 -U --setup-dns --forwarder 10.11.5.19 --ip-address 10.0.199.157 --realm UFREEIPA.TEST --domain ufreeipa.test --dirsrv-config-file /root/ipatests/ipatests_dse.ldif

Expected results
The replica installation randomly hangs in a LDAP call.

The text was updated successfully, but these errors were encountered:

flo-renaud · 2025-02-28T12:58:32Z

stacktrace.1740746667.txt

stracktrace obtained with gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' /usr/sbin/ns-slapd pidof ns-slapd > stacktrace.date +%s.txt 2>&1

flo-renaud · 2025-02-28T12:58:45Z

pytest-run.log

flo-renaud · 2025-02-28T12:58:53Z

ipareplica-install.log

flo-renaud · 2025-02-28T13:00:06Z

The last lines from the access log:
[28/Feb/2025:09:43:47.734546549 +0000] conn=3 op=5 SRCH base="cn=KDC,cn=replica1.ufreeipa.test,cn=masters,cn=ipa,cn=etc,dc=ufreeipa,dc=test" scope=0 filter="(objectClass=*)" attrs="ipaConfigString"
[28/Feb/2025:09:43:47.753759626 +0000] conn=3 op=5 RESULT err=32 tag=101 nentries=0 wtime=0.000185908 optime=0.019216143 etime=0.019396961
[28/Feb/2025:09:43:47.772836327 +0000] conn=3 op=6 ADD dn="cn=KDC,cn=replica1.ufreeipa.test,cn=masters,cn=ipa,cn=etc,dc=ufreeipa,dc=test"

flo-renaud · 2025-02-28T13:01:19Z

errors.log

flo-renaud · 2025-02-28T13:01:55Z

access.log

flo-renaud · 2025-02-28T13:02:22Z

audit.log

flo-renaud · 2025-02-28T13:07:43Z

Fom ipa-replica-install.log, we can see the first issue is the failure to restart krb5kdc:

2025-02-28T09:43:01Z DEBUG args=['/bin/systemctl', 'start', 'krb5kdc.service']
2025-02-28T09:43:47Z DEBUG Process finished, return code=1
2025-02-28T09:43:47Z DEBUG stdout=
2025-02-28T09:43:47Z DEBUG stderr=Job for krb5kdc.service failed because a timeout was exceeded.
See "systemctl status krb5kdc.service" and "journalctl -xeu krb5kdc.service" for details.

2025-02-28T09:43:47Z CRITICAL krb5kdc service failed to start
2025-02-28T09:43:47Z DEBUG step duration: krb5kdc __start_instance 45.27 sec

The journal:

Feb 28 09:43:01 replica1.ufreeipa.test systemd[1]: Starting krb5kdc.service - Kerberos 5 KDC...
Feb 28 09:43:47 replica1.ufreeipa.test systemd[1]: krb5kdc.service: start operation timed out. Terminating.
Feb 28 09:43:47 replica1.ufreeipa.test systemd[1]: krb5kdc.service: Failed with result 'timeout'.
Feb 28 09:43:47 replica1.ufreeipa.test systemd[1]: Failed to start krb5kdc.service - Kerberos 5 KDC.

The timestamp matches the last ADD operation which didn't get any answer.

tbordaz · 2025-02-28T13:50:12Z

At first look in the stacktrace threads 8 and 25 are starting txn and are blocked. The others threads looks victims of this hang. Are they (8 and 25) in deadlock or blocked by another thread I do not know but I did not find other thread having started a TXN. What is surprising to me is thread 8 starting a TXN during a search.
DO you know how frequent it is (hang), when it started to happen and if it happened on 389-ds-base-3.0 ?

flo-renaud · 2025-02-28T14:12:00Z

We also see some hangs with 389-ds-base-3.0.4-3.fc40.x86_64 on fedora 40. It happens ~ in 15 tests out of 169.

progier389 · 2025-02-28T16:02:02Z

@tbordaz FYI: any lmdb database access (including reading records) requires a txn so it is quite normal that search tries to open one.

Thread 10 has an open txn that probably blocks the other treads trying to open a txn.
I suspect that it deadlocks with thread 8

The weird thing is that dbmdb_open_dbi_from_filename in thread 8 tries to open a txn:
Normally all the database files should be open when starting (or restarting) the backend
and dbmdb_open_dbi_from_filename should just get the dbi index from an hash table ...

I wonder if an index config entry was added and but backend was never (re)indexed
( That could explain why search is trying to access an index but the dbi is not yet open nor existing ...)

Probably a good idea to check unfinished searches in access log to determine which index is involved ....

flo-renaud added the needs triage The issue will be triaged during scrum label Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipa-replica-install randomly hangs #6644

ipa-replica-install randomly hangs #6644

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025 •

edited

Loading

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

tbordaz commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

progier389 commented Feb 28, 2025

ipa-replica-install randomly hangs #6644

ipa-replica-install randomly hangs #6644

Comments

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025 • edited Loading

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

tbordaz commented Feb 28, 2025

flo-renaud commented Feb 28, 2025

progier389 commented Feb 28, 2025

flo-renaud commented Feb 28, 2025 •

edited

Loading