Skip to content

ETA tasks lost on worker TERM restart #533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
safwanebrahim-uc opened this issue Oct 27, 2015 · 13 comments
Open

ETA tasks lost on worker TERM restart #533

safwanebrahim-uc opened this issue Oct 27, 2015 · 13 comments

Comments

@safwanebrahim-uc
Copy link

I have clusters of workers listening to a redis server. When i do restart worker(using salt stack), the running tasks will comple execution since its a TERM signal, but all the scheduled tasks (ETA set) are lost

@ask
Copy link
Contributor

ask commented Oct 27, 2015

Do you have any steps to reproduce this?

What does the worker log say at shutdown?

@safwanebrahim-uc
Copy link
Author

I have around 20 instances running as a cluster, consume from a redis server located in a separate instance. I use saltstack to deploy things on these 20 instances. Will be doing worker restart as well along with deploys.

This is an instance celery log
[2015-10-27 11:25:11,889: INFO/MainProcess] mingle: searching for neighbors
[2015-10-27 11:25:12,896: INFO/MainProcess] mingle: all alone

and it seems the instances do not sync each other since worker restart happenig in same point in time.

Problem found is that when we restart worker during execution of a task, all the scheduled task will be lost from the queue

I use supervisor as daemon process for worker process

For example. Test1 and Test2 are two consumer instances of a queue. i have added two tasks, T1(countdown = 60), T2(countdown = 86400).

I just restarted the worker using saltstack, now both the tasks are there in Test2


worker: Warm shutdown (MainProcess)
[2015-10-27 11:25:09,598: WARNING/MainProcess] Restoring 1 unacknowledged message(s).
[2015-10-27 11:25:11,861: WARNING/MainProcess] /usr/local/lib/python2.7/dist-packages/celery/apps/worker.py:161: CDeprecationWarning:
Starting from version 3.2 Celery will refuse to accept pickle by default.

The pickle serializer is a security concern as it may give attackers
the ability to execute any command. It's important to secure
your broker from unauthorized access when using pickle, so we think
that enabling pickle should require a deliberate action and not be
the default choice.

If you depend on pickle then you should set a setting to disable this
warning and to be sure that everything will continue working
when you upgrade to Celery 3.2::

CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']

You must only enable the serializers that you will actually use.

warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED))
[2015-10-27 11:25:11,881: INFO/MainProcess] Connected to redis://10.83.73.169:9875/0
[2015-10-27 11:25:11,889: INFO/MainProcess] mingle: searching for neighbors
[2015-10-27 11:25:12,896: INFO/MainProcess] mingle: all alone
[2015-10-27 11:25:12,913: WARNING/MainProcess] [email protected] ready.
[2015-10-27 11:25:13,127: INFO/MainProcess] Received task: taskXXX.Task_XXX[134feb23-b82f-4cd2-8974-2a6b150a9386] eta:[2016-02-04 12:25:58.399643+00:00]
[2015-10-27 11:25:13,129: INFO/MainProcess] Received task: task_XXX.Task_XXX[80c608c3-b7fa-4318-8df9-73e435df514d] eta:[2015-10-27 11:26:19.881562+00:00]


i restarted the supervisor when a task T1 (scheduled at 2015-10-27 11:26:19.881562+00:00 ) is running

The running task finishes the execution as i see my database since it does some updates in my tables , and the task T2 (scheduled to run at 2016-02-04 12:25:58.399643+00:00) is not found in the queue (not restored by either Test1 or Test2)

@ask
Copy link
Contributor

ask commented Oct 28, 2015

Start the worker with -l debug please, to get more logging.

@ask ask closed this as completed Jun 24, 2016
@scythargon
Copy link

Same here in 2019

@auvipy auvipy reopened this Dec 23, 2019
@nathanielastudillo
Copy link

Having this issue with RabbitMQ as well

@auvipy
Copy link
Member

auvipy commented Jan 28, 2020

celery n kombu version?

@mkuchen
Copy link

mkuchen commented Mar 6, 2020

Also having this issue with a task ETA that is scheduled to occur after a proc restart

kombu==4.6.4
celery==4.3.0

@auvipy auvipy added this to the 4.7 milestone May 5, 2020
@auvipy auvipy modified the milestones: 5.1.0, 5.3 Sep 12, 2021
@gdvalderrama
Copy link

@auvipy so just to check, is this solved in some version? Or is it an ongoing issue?

@auvipy
Copy link
Member

auvipy commented Dec 14, 2022

I don't know! need some one to verify in production using the latest development branch. but most probably no

@torotil
Copy link

torotil commented May 15, 2023

I’m observing the same issue in production with a very simple setup: Single worker, single task, redis backend, unlimited auto-retry, acks_late. Versions are:

kombu==5.2.4
celery==5.2.7

kombu said

Restoring xxx unacknowledged message(s)

in the logs. But the Redis DB is empty now.

@torotil
Copy link

torotil commented May 26, 2023

I can reproduce this very easily in my local dev setup (vagrant box) and might help to track it down.

In [9]: worker.APP.send_task("logcrm_event_bus.send_event", [{}, "impact-stack>test"], countdown=2*3600)
Out[9]: <AsyncResult: 63e995fe-012f-497d-a523-c02211e48362>

In [10]: i.scheduled()
Out[10]: 
{'celery@impactbox': [{'eta': '2023-05-26T10:35:45.630970+00:00',
   'priority': 6,
   'request': {'id': '63e995fe-012f-497d-a523-c02211e48362',
    'name': 'logcrm_event_bus.send_event',
    'args': [{}, 'impact-stack>test'],
    'kwargs': {},
    'type': 'logcrm_event_bus.send_event',
    'hostname': 'celery@impactbox',
    'time_start': None,
    'acknowledged': False,
    'delivery_info': {'exchange': '',
     'routing_key': 'celery',
     'priority': 0,
     'redelivered': None},
    'worker_pid': None}}]}

In another shell I restart the worker service and the task is gone.

In [11]: i.scheduled()
Out[11]: {'celery@impactbox': []}

Here is the output of the celery worker service with debug log-level: systemctl-debug.txt

@torotil
Copy link

torotil commented May 26, 2023

Here is the service file used to launch the worker logcrm-event-bus.service. Maybe that’s where the problem lies.

@torotil
Copy link

torotil commented May 30, 2023

journal-20230530-rabbitmq.txt

I’ve tried reproducing this with the same setup but using RabbitMQ and the problem did not occur. Note that the service shuts down successfully in this case too (with the same service file).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants