Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipa-replica-install randomly hangs #6644

Open
flo-renaud opened this issue Feb 28, 2025 · 11 comments
Open

ipa-replica-install randomly hangs #6644

flo-renaud opened this issue Feb 28, 2025 · 11 comments
Labels
needs triage The issue will be triaged during scrum

Comments

@flo-renaud
Copy link

Issue Description
The installation of an IPA replica sometimes hangs in a LDAP operation.

Package Version and Platform:
Fedora 41
389-ds-base-3.1.2-3.fc41.x86_64
krb5-server-1.21.3-4.fc41.x86_64
freeipa-server-4.13.0.dev202502271132+git-0.fc41.x86_64

Steps to Reproduce
Steps to reproduce the behavior:

  1. install master ipa-server-install -n ufreeipa.test -r UFREEIPA.TEST -p Secret.123 -a Secret.123 --domain-level=1 --dirsrv-config-file /root/ipatests/ipatests_dse.ldif -U --setup-dns --forwarder 10.11.5.19 --auto-reverse
  2. install replica with ipa-replica-install --admin-password Secret.123 --password Secret.123 -U --setup-dns --forwarder 10.11.5.19 --ip-address 10.0.199.157 --realm UFREEIPA.TEST --domain ufreeipa.test --dirsrv-config-file /root/ipatests/ipatests_dse.ldif

Expected results
The replica installation randomly hangs in a LDAP call.

@flo-renaud flo-renaud added the needs triage The issue will be triaged during scrum label Feb 28, 2025
@flo-renaud
Copy link
Author

flo-renaud commented Feb 28, 2025

stacktrace.1740746667.txt

stracktrace obtained with gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' /usr/sbin/ns-slapd pidof ns-slapd > stacktrace.date +%s.txt 2>&1

@flo-renaud
Copy link
Author

pytest-run.log

@flo-renaud
Copy link
Author

ipareplica-install.log

@flo-renaud
Copy link
Author

The last lines from the access log:
[28/Feb/2025:09:43:47.734546549 +0000] conn=3 op=5 SRCH base="cn=KDC,cn=replica1.ufreeipa.test,cn=masters,cn=ipa,cn=etc,dc=ufreeipa,dc=test" scope=0 filter="(objectClass=*)" attrs="ipaConfigString"
[28/Feb/2025:09:43:47.753759626 +0000] conn=3 op=5 RESULT err=32 tag=101 nentries=0 wtime=0.000185908 optime=0.019216143 etime=0.019396961
[28/Feb/2025:09:43:47.772836327 +0000] conn=3 op=6 ADD dn="cn=KDC,cn=replica1.ufreeipa.test,cn=masters,cn=ipa,cn=etc,dc=ufreeipa,dc=test"

@flo-renaud
Copy link
Author

errors.log

@flo-renaud
Copy link
Author

access.log

@flo-renaud
Copy link
Author

audit.log

@flo-renaud
Copy link
Author

Fom ipa-replica-install.log, we can see the first issue is the failure to restart krb5kdc:

2025-02-28T09:43:01Z DEBUG args=['/bin/systemctl', 'start', 'krb5kdc.service']
2025-02-28T09:43:47Z DEBUG Process finished, return code=1
2025-02-28T09:43:47Z DEBUG stdout=
2025-02-28T09:43:47Z DEBUG stderr=Job for krb5kdc.service failed because a timeout was exceeded.
See "systemctl status krb5kdc.service" and "journalctl -xeu krb5kdc.service" for details.

2025-02-28T09:43:47Z CRITICAL krb5kdc service failed to start
2025-02-28T09:43:47Z DEBUG step duration: krb5kdc __start_instance 45.27 sec

The journal:

Feb 28 09:43:01 replica1.ufreeipa.test systemd[1]: Starting krb5kdc.service - Kerberos 5 KDC...
Feb 28 09:43:47 replica1.ufreeipa.test systemd[1]: krb5kdc.service: start operation timed out. Terminating.
Feb 28 09:43:47 replica1.ufreeipa.test systemd[1]: krb5kdc.service: Failed with result 'timeout'.
Feb 28 09:43:47 replica1.ufreeipa.test systemd[1]: Failed to start krb5kdc.service - Kerberos 5 KDC.

The timestamp matches the last ADD operation which didn't get any answer.

@tbordaz
Copy link
Contributor

tbordaz commented Feb 28, 2025

At first look in the stacktrace threads 8 and 25 are starting txn and are blocked. The others threads looks victims of this hang. Are they (8 and 25) in deadlock or blocked by another thread I do not know but I did not find other thread having started a TXN. What is surprising to me is thread 8 starting a TXN during a search.
DO you know how frequent it is (hang), when it started to happen and if it happened on 389-ds-base-3.0 ?

@flo-renaud
Copy link
Author

We also see some hangs with 389-ds-base-3.0.4-3.fc40.x86_64 on fedora 40. It happens ~ in 15 tests out of 169.

@progier389
Copy link
Contributor

@tbordaz FYI: any lmdb database access (including reading records) requires a txn so it is quite normal that search tries to open one.

Thread 10 has an open txn that probably blocks the other treads trying to open a txn.
I suspect that it deadlocks with thread 8

The weird thing is that dbmdb_open_dbi_from_filename in thread 8 tries to open a txn:
Normally all the database files should be open when starting (or restarting) the backend
and dbmdb_open_dbi_from_filename should just get the dbi index from an hash table ...

I wonder if an index config entry was added and but backend was never (re)indexed
( That could explain why search is trying to access an index but the dbi is not yet open nor existing ...)

Probably a good idea to check unfinished searches in access log to determine which index is involved ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage The issue will be triaged during scrum
Projects
None yet
Development

No branches or pull requests

3 participants