-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 6632 - Replication init fails with ASAN build #6633
base: main
Are you sure you want to change the base?
Conversation
Bug Description: After the increase of DEFAULT_PBKDF2_ROUNDS to 100_000, bind operation on ASAN build takes more time. On some systems it's more time than the default timeouts for replication agreements, causing replication init to fail. Fix Description: Introduce a new helper function `get_timeout_scale()` and update hardcoded timeouts to use it for scaling. For ASAN builds it returns 2.0 instead of default 1.0. Factor returned by `get_timeout_scale()` can be adjusted using environment variable `DS_TIMEOUT_SCALE`. Fixes: 389ds#6632
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of small issues, - otherwise, looks good!
@@ -2362,13 +2365,16 @@ def ensure_agreement(self, from_instance, to_instance, init=False): | |||
assert dn is not None | |||
assert creds is not None | |||
|
|||
timeout = 5 | |||
timeout = int(timeout * get_timeout_scale()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think round()
will give more correct results.
Like, if DS_TIMEOUT_SCALE=1.6 and timeout
is 6
, then:
- float is 9.6
int(timeout * get_timeout_scale()) == 9
round(timeout * get_timeout_scale()) == 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not meant to be precise, since it applies to many different timeouts. And I used float to also be able to reduce the default timeouts, for example 0.5 to cut the default timeouts in half. For ASAN builds on slower VMs you want to use 2 or 3 just to make the tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'm okay with both options. I just thought that the intention was to round the number. If we are okay with just discarding any remainder, I see no issues with using int()
.
scale_factor = 1.0 | ||
if Paths().asan_enabled: | ||
scale_factor = 2.0 | ||
return float(os.getenv('DS_TIMEOUT_SCALE', default=scale_factor)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might worth to validate the env var here. It'll raise a bit cryptic (for user) message otherwise (for example, if DS_TIMEOUT_SCALE=testvar
):
ValueError: could not convert string to float: 'testvar'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What behaviour do you want in case the validation fails? Set the timeout scale to 1.0 or log some error? This env var is not meant to be used by the end user, it's for us to deal with occasional slowness of the test infra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an error that mentions the env var will work the best. It will help us to see exactly where is issue (and it'll be future-proof, in case anyone else will use the internal feature for some reason)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO we should check that it is a float and its value is >= 1/timeout (and just tell that a float >= 0.2 is expected
Even if it is only for tester, a typo is so quick to do and a clear message will avoid us spending too much time to understand whats is happening!
Bug Description:
After the increase of DEFAULT_PBKDF2_ROUNDS to 100_000, bind operation on ASAN build takes more time. On some systems it's more time than the default timeouts for replication agreements, causing replication init to fail.
Fix Description:
Introduce a new helper function
get_timeout_scale()
and update hardcoded timeouts to use it for scaling. For ASAN builds it returns 2.0 instead of default 1.0.Factor returned by
get_timeout_scale()
can be adjusted using environment variableDS_TIMEOUT_SCALE
.Fixes: #6632