Ensure replication running after replication restart #1422

andyedison · 2024-06-11T19:39:38Z

A Pull Request should be associated with an Issue.

We wish to have discussions in Issues. A single issue may be targeted by multiple PRs.
If you're offering a new feature or fixing anything, we'd like to know beforehand in Issues,
and potentially we'll be able to point development in a particular direction.

Related issue:

Further notes in https://github.com/github/gh-ost/blob/master/.github/CONTRIBUTING.md
Thank you! We are open to PRs, but please understand if for technical reasons we are unable to accept each and any PR

Description

This PR adds checks to function restartReplication that ensures that replication has started before continuing. Before adding this check, there as a hard coded 500ms wait time and then the program assumed that the replication threads were started and running (added in #337).

We encountered situations in different environments that this wait time wasn't sufficient. As an experiment, we doubled this wait time and deployed it to our live environments to see if this resolves the issue. This did help solve the problem, so now we are coming back to find a better permanent fix.

example output from our logs:

2024-06-11 08:17:41 FATAL Replication on <replica-hostname>:3306 is broken: Slave_IO_Running: Connecting, Slave_SQL_Running: Yes. Please make sure replication runs before using gh-ost.

old description

This PR increases the `startSlavePostWaitMilliseconds` as we are seeing an error when running `gh-ost` in some cloud environments that the `Slave_IO_Running` is `Connecting` rather than `Yes` as expected.

We found this old PR that described the issue we're having #337 - as a first step we are increasing by doubling the value. If this test is successful, then we'll look into making this something that could be configured.

In case this PR introduced Go code changes:

contributed code is using same conventions as original code
script/cibuild returns with no formatting errors, build errors or unit test errors.

timvaillancourt · 2024-08-15T22:24:59Z

@andyedison I wonder if this fix is good enough for all cases. The sleep that exists now is a bit hacky

in some cloud environments that the Slave_IO_Running is Connecting rather than Yes as expected.

What should we be waiting for? I think you're saying the IO thread running. Whatever the answer, it would be safer if gh-ost waited + checked that what we want is achieved vs a time.Sleep()

andyedison · 2024-08-19T15:30:05Z

No I agree, I doubt this is good enough for all cases. This was a bit of an experiment to see if increasing this time would prevent the errors we were seeing in a particular environment from happening over a period of time. I believe it has, we just haven't had time to swing back to this and dig into it to find a more permanent solution

…s, with timeout

andyedison · 2024-10-17T19:22:38Z

I've updated the PR and title+description to better represent what this change is. Instead of assuming that replication has resumed successfully after the timeout, I made the change to instead check if it is running, and if not, wait an interval before trying to check again, erroring if we exceed a maximum wait time

andyedison · 2024-10-17T19:24:09Z

go/logic/inspect.go

@@ -22,7 +22,8 @@ import (
 	"github.com/openark/golib/sqlutils"
 )

-const startSlavePostWaitMilliseconds = 500 * time.Millisecond
+const startReplicationPostWait = 250 * time.Millisecond
+const startReplicationMaxWait = 2 * time.Second


2 seconds was somewhat arbitrary, I could be convinced to adjust this if others had strong opinions

meiji163 and others added 2 commits February 2, 2024 09:10

dummy commit

adf8a6e

Double the time waiting post replication restart

ef238ee

andyedison requested review from rashiq, meiji163 and timvaillancourt as code owners June 11, 2024 19:39

meiji163 previously approved these changes Jun 11, 2024

View reviewed changes

andyedison mentioned this pull request Jun 11, 2024

Double the time waiting post replication restart #1421

Open

2 tasks

andyedison had a problem deploying to prod-ae-01 June 11, 2024 19:50 Failure

andyedison deployed to prod-ae-01 June 11, 2024 19:52 Active

andyedison added 2 commits October 14, 2024 16:15

Add logic to ensure replication has started instead of assuming it ha…

1e25e71

…s, with timeout

Revert unintended change

df7fba3

andyedison dismissed meiji163’s stale review via df7fba3 October 14, 2024 16:16

Merge branch 'master' into adjust-wait-post-repl-restart

ea5708e

andyedison changed the title ~~Adjust wait time after replication restart~~ Ensure replication running after replication restart Oct 14, 2024

meiji163 approved these changes Oct 15, 2024

View reviewed changes

andyedison commented Oct 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure replication running after replication restart #1422

Ensure replication running after replication restart #1422

andyedison commented Jun 11, 2024 •

edited

Loading

timvaillancourt commented Aug 15, 2024

andyedison commented Aug 19, 2024

andyedison commented Oct 17, 2024

andyedison Oct 17, 2024

Ensure replication running after replication restart #1422

Are you sure you want to change the base?

Ensure replication running after replication restart #1422

Conversation

andyedison commented Jun 11, 2024 • edited Loading

A Pull Request should be associated with an Issue.

Description

timvaillancourt commented Aug 15, 2024

andyedison commented Aug 19, 2024

andyedison commented Oct 17, 2024

andyedison Oct 17, 2024

Choose a reason for hiding this comment

andyedison commented Jun 11, 2024 •

edited

Loading