Skip to content

Fix health-check to check actual health #521

@sbernauer

Description

@sbernauer

Affected version

0.0.0-dev

Current and expected behavior

We have lost data in a demo, as Nifi was complaining about to reaching ZooKeeper and the health-checks did not notice it.
Simply restarting the pod solved the problem, which would have done if the livenessProbe would have detected the problem.

Currently the livenessProbe looks like

    livenessProbe:
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      tcpSocket:
        port: https
      timeoutSeconds: 1

While the numbers itself are arguable - (e.g why have a initialDelaySeconds when we have a startup probe?) and a readinessProbe is missing - the most important thing is, that a simple check on the port is not enough.

Possible solution

We should instead use https://nifi.apache.org/docs/nifi-docs/rest-api/ to check the actual node health. The most complicated part will be auth I fear (e.g. add a static user with an operator-created random secret and put it in the Authentication chain),

Additional context

No response

Environment

No response

Would you like to work on fixing this bug?

yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions