Description
Describe the feature.
Is your feature related to a problem? Please describe.
Sporadically we see users mention that ingestion halted. Logs provide no insights and show no issues but after users restart that instance ingestion resumes.
Users detect such issues because their queues reach their quotas which has other side effects.
Describe the requested feature
The feature could keep track how much time has passed after the last ingested message and if there has been no activity in for example 5 minutes the instance should trigger a termination sequence so that the host will restart the instance.
It could be that there are actually no messages in the queue and the restart was not required but then at least the instance is running as a fresh process.
As an alternative, this could also be done after a certain duration although that could be handled in the environment via a scheduled task (restart service every day at 02:00 AM).
Optionally expose the "last message received timestamp" to a JSON result on a /health
API
Queue monitoring
This logic could be enhanced by querying the age of the oldest message in the queue (or alternatively, the length of a queue). If the queue is empty then it is expected that there is no activity but otherwise, this indicates that the message pump is no longer working and we are in an unrecoverable state and should terminate.
Workaround
- Run a script at a fixed interval to stop/start each instance
- Script or application that inspects the relevant queues as described above and then to stop/start the corresponding instance