Affected version
0.0.0-dev
Current and expected behavior
We have lost data in a demo, as Nifi was complaining about to reaching ZooKeeper and the health-checks did not notice it.
Simply restarting the pod solved the problem, which would have done if the livenessProbe would have detected the problem.
Currently the livenessProbe looks like
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: https
timeoutSeconds: 1
While the numbers itself are arguable - (e.g why have a initialDelaySeconds when we have a startup probe?) and a readinessProbe is missing - the most important thing is, that a simple check on the port is not enough.
Possible solution
We should instead use https://nifi.apache.org/docs/nifi-docs/rest-api/ to check the actual node health. The most complicated part will be auth I fear (e.g. add a static user with an operator-created random secret and put it in the Authentication chain),
Additional context
No response
Environment
No response
Would you like to work on fixing this bug?
yes
Affected version
0.0.0-dev
Current and expected behavior
We have lost data in a demo, as Nifi was complaining about to reaching ZooKeeper and the health-checks did not notice it.
Simply restarting the pod solved the problem, which would have done if the
livenessProbewould have detected the problem.Currently the
livenessProbelooks likeWhile the numbers itself are arguable - (e.g why have a
initialDelaySecondswhen we have a startup probe?) and areadinessProbeis missing - the most important thing is, that a simple check on the port is not enough.Possible solution
We should instead use https://nifi.apache.org/docs/nifi-docs/rest-api/ to check the actual node health. The most complicated part will be auth I fear (e.g. add a static user with an operator-created random secret and put it in the Authentication chain),
Additional context
No response
Environment
No response
Would you like to work on fixing this bug?
yes