-
-
Notifications
You must be signed in to change notification settings - Fork 949
Update to Kombu 5.5.0 stopped processing SQS messages #2258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm seeing the same thing. It's not completely broken, but processing is extremely slow. |
Can you please share a reproduction script or steps to reproduce, I'd like to solve this ASAP. Thank you. |
Our workers process tasks from four standard SQS queues. Task execution time is fast (~50ms) and we use |
Ok, found it. Throughput: 0.99 tasks per second with
|
the performance regressions are hard to address in short notice to be honest |
I think we have to stick to special implementation of pycurl until we have our homegrown faster or equal alternative of pycurl. some previous attempts also faced the serious performance regressions |
If we can't figure out by next 1 or 2 weeks, we should revert this #2261 |
Unfortunately we don't have the luxury of time my friend. I will take care of everything and release a fixed version on Monday. I just want to give it a few days for @spawn-guy to check this out, but per my script above we can confirm the issue is only due to the dep changes. |
I released it before the weekend knowing it might have issues. I am prepared to respond quickly. I take responsibility, please don't stress yourself over the weekend 🙏 |
Any recommendation? |
there is no need to feel guilty. celery code base is hard. |
I have to test alternatives, right now I think we should revert back to pycurl as it was working fine as far as I know. |
the main difference between pycurl and urllib3 was is using the available @mdrachuk what is your use-case? does this pr-revert fix your problem? @Nusnus have you tried tuning the i am running my kombu fork (not the rc versions) and things seem to be processed fine, but not at the speeds mentioned |
another idea i have is about the SSL-certificates and optional |
here are my "from-home" on Windows tests of the code here #2258 (comment)
max_clients=10
FORKED_BY_MULTIPROCESSING - didn't do anything kombu = "*"
max_clients: int = 100
max_clients: int = 1
max_clients: int = 10
max_clients: int = 1
i dunno. is it Windows? is it
all speeds seem to be the same. with or without urllib3/pycurl on both kombu versions: latest and 5.4.2 help! |
aws access keys are picked up from a default location. no extra env vars are set. |
May be we should highlight these suggestions you shared for people to avoid these performance issues |
@auvipy i'd like to hear more use-case specifics from @mdrachuk and @mgorven : environments, os'es, python versions. with-or-without a proxy (as i didn't test that one out) i am running not-high-frequency tasks(1+ seconds) on aws elastic beanstalk py3.11 instances with the urllib3 version of kombu and another reason of slowdown with maybe it's like, i can agree with slowdown, maybe, some configuration issue or aws outage or WAF interference on seeing a new |
as a side-note: i am also thinking about |
we will revisit our recent and old experiments for introducing native async support in v6.0. but we have to reach to a consensus for this exact issue for now with recent changes |
we can also try httpx and see later |
@spawn-guy We're using Python 3.11.11 on Debian Bookworm/12 aarch64. No proxy for SQS. |
// On Topic We will revert back to pycurl and release v5.6. // Off Topic
I am trying to be as creative as I’ve ever been in my life to solve the challenges of migrating to asyncio. It became a mission for me. It requires solving a completely different core challenge first, which makes the difficulty extremely high and multi-dimensional, but this only makes it more attractive tbh 😉 EDIT: |
That’s my reasoning too. Enabling both by choice with pycurl as default can be an acceptable middle ground. WDYT? |
Yes |
sure. and will still try to reproduce the slowdown today/tomorrow. no luck on windows so far. |
Would it be recommended to use 5.4.2 on Python 3.13 to avoid this issue, given that 5.5 is the first version of Kombu to officially support Python 3.13? If not, would it be possible to get a 5.4.3 release with support for Python 3.13 until 5.5+ has stabilized? |
it is not so clear to me anymore. first: @Nusnus how many times have you tried your test? second: "our" urllib3 client seems to be only used to FETCH and SEND messages. on i've deployed things to amazon linux 2023 and now running more iterations of the same test from y home windows(cloud-home delay).
testing the speed of urllib3.
the numbers fluctuate too much from and 500 tasks seem to work faster than 50
the numbers fluctuate too much from |
As I have pycurl deployment problems again - I am thinking about the fixing strategy. I will roll back the deletion of pycurl and dependencies. The best way would be to introduce choice via celery configure, like pool choice. But I don't have enough time to do this. But I need some advice on package dependency : we now use sqs extra, that required pycurl. And I have pycurl problems on instance deployment.
The ci will use pycurl by default What do you say? |
initial code here #2269 when i've picked the 3rd route to requirements management: |
so i've deployed the "fallback urllib3" version to my @mgorven will you be so kind to test the branch with |
i have been thinking and commenting with @jmsmkn and.. the thing is.. i can't reliably state that pycurl-urllib3 is the problem here. 'cause my change was not the only one since 5,4,2. on the other hand i have also enabled ssl connections for sqs which were disabled instead of "fixed" previously. it looks like i still need to test pycurl on aws anyway :( and also if ssl works correctly |
The branch doesn't work at all: https://gist.github.com/mgorven/f1689323acb1a4e981644dfc9afe87ab This is using rev 77ca118 and pycurl 7.45.6. |
@mgorven thanks for the feedback i'll look into it today. even though this is the code form |
@mgorven the code in PR is fixed and tested with pycurl! and it works @auvipy @Nusnus here are the speed test results client: is on windows11 and py3.11. so the client is code is seen here: #2258 (comment)
results
withOUT
test task output
results
|
Maybe then we could indeed set it for 5.6. |
The |
We can include this to be released with kombu 5.6 and celery 5.5.1. Usually we Don't need a major celery version for a major kombu version. As this change is Just adjustments. |
so i figured out the testing part and.. now i'm investigating the slowdown. for one: i did find a mis-configuration in my code. and yes, it is related to second: i am now testing limiting the number of concurrent connections to a host and reusing existing ones, as pycurl implementation does(did). instead of creating new unlimited connections but only keeping the last( twoB: AWS SQS says there is no limit to concurrent connections for Standard Queues, but 300 for FIFO type (and it can be increased) i wonder if i should increase third: it seems that i can optimize my code and use 1 PoolManager for both proxy and non-proxy connections (and also replace some code of mine to select a pool). in urllib3 there are a few layers of Pools (sic!): fourth: i should look into extra: the urllib3 version works at |
Will comeback to you tomorrow as I'm on vacation |
Just adding that we had a similar experience, although seemingly worse with SQS, Kombu 5.5.0, and Celery 5.5.0. In our case, the full lifecycle of a message+task that normally took ~50ms was now taking a minute or more, long enough that it actually timed out in SQS before deleting the message. Downgrading Celery to 5.4.0 and Kombu to 5.4.2 fixed it for us. |
This release is completely broken for anyone using SQS in production as the performance decreases so drastically. This change in performance is not something people expect as part of a minor version release. I would strongly urge the team to revert the change causing this to avoid more people running into real production incidents from upgrading to this bad version of Kombu. |
@ebk46 @soceanainn please try this branch with pycurl. To see 8f it works and helps with the slowdown kombu = { git = "https://github.com/spawn-guy/kombu.git", ref = "feature_optinal_pycurl" } pycurl = { version = "*", markers = "sys_platform != 'win32'", install_command = "pip install pycurl --global-option='--with-openssl' --compile" } |
status update: as i am struggling with the urllib3 speed-up and i can verify the significant speed decrease and some implementation mistakes and i also figured out a relatively easy way to install pycurl on amzn2023 linux:
the problem with 3 is the time might never come for me 😝 as now i am still re-implementing it as in 4 let's vote: |
I prefer we revert the dep change to go “back to normal”, and release v5.6 quickly.
Assuming we revert the change completely, we have all the time in the world to get it done right IMHO. The motivation is that v5.5.x and v5.6.x should be exactly the same except for the pycurl change (reverted back in v5.6), so we can progress with removing python 3.8 for v5.7 and “free” main from this issue while allowing silently skipping v5.5 without losing anything at all but keeping the pycurl from v5.4 and below. @spawn-guy Does that make sense to you? |
You’re right this is a serious issue and I understand the frustration 🙏 Also take note @spawn-guy is very active and responsive so we’re in good hands and we’ll have a proper solution soon. EDIT: |
Yes lets just revert it |
Hello.
Tbh I'm confused about the debug steps for this, because we didn't have any alerts or error logs. Only the actual usage showed that the tasks from the SQS weren't processing anymore.
Downgrade to 5.4.2 solved the issue for now.
Leaving this here for anybody having similar issues to see and maybe comment on details they can find.
The text was updated successfully, but these errors were encountered: