Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto restart all instance types if there is no activity #3474

Open
ramonsmits opened this issue Mar 23, 2023 · 6 comments
Open

Auto restart all instance types if there is no activity #3474

ramonsmits opened this issue Mar 23, 2023 · 6 comments

Comments

@ramonsmits
Copy link
Member

ramonsmits commented Mar 23, 2023

Describe the feature.

Is your feature related to a problem? Please describe.

Sporadically we see users mention that ingestion halted. Logs provide no insights and show no issues but after users restart that instance ingestion resumes.

Users detect such issues because their queues reach their quotas which has other side effects.

Describe the requested feature

The feature could keep track how much time has passed after the last ingested message and if there has been no activity in for example 5 minutes the instance should trigger a termination sequence so that the host will restart the instance.

It could be that there are actually no messages in the queue and the restart was not required but then at least the instance is running as a fresh process.

As an alternative, this could also be done after a certain duration although that could be handled in the environment via a scheduled task (restart service every day at 02:00 AM).

Optionally expose the "last message received timestamp" to a JSON result on a /health API

Queue monitoring

This logic could be enhanced by querying the age of the oldest message in the queue (or alternatively, the length of a queue). If the queue is empty then it is expected that there is no activity but otherwise, this indicates that the message pump is no longer working and we are in an unrecoverable state and should terminate.

Workaround

  1. Run a script at a fixed interval to stop/start each instance
  2. Script or application that inspects the relevant queues as described above and then to stop/start the corresponding instance
@YurivanRuler
Copy link

Do we have any updates on this feature? The issue of the monitoring instance, which causes message consumption to frequently stop, keeps occurring.

@ramonsmits
Copy link
Member Author

@YurivanRuler We don't have roadmaps but thanks for engaging and letting us know this is important to you.

@Nickxsch
Copy link

How about 10 months later? :-)

@YurivanRuler
Copy link

Hi @ramonsmits, do you have any updates on this issue? During peak loads we notice that the audit services stop consuming messages. Presently, we've implemented the workaround by restarting the audit services hourly, but we are seeking a more stable solution. Additionally, during peak moments, the scheduled service restarts could potentially slow down ingestion. Any insights or recommendations would be greatly appreciated.

@lailabougria
Copy link
Contributor

Hi @YurivanRuler, I'm afraid we still can't provide any timelines on this. Thanks for bringing this back to our attention. Once we start working on the issue, we'll keep you up to date on this issue.

@ramonsmits
Copy link
Member Author

Auto-restart must be able to deal with any orphaned child processes due to:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants