Doubled buckets. Part 0 (rebalancer batching + ref inconsistencies during master switch) by Serpentian · Pull Request #649 · tarantool/vshard

Serpentian · 2026-05-13T18:07:58Z

Closes #573
Closes #351
Closes #528

The Luatest 1.4.2 has broken the compatibility with Tarantool 1.10. See tarantool/luatest#453 for details. Let's use 1.4.1 for now. NO_DOC=ci NO_TEST=ci

Before this patch we woke up the GC service instead of recovery service in `bucket_recovery_pause`. It could lead to a longer tests' execution time. Now, we fix it by changing `garbage_collector_wakeup` to `recovery_wakeup`. NO_TEST=refactoring NO_DOC=refactoring

Before this patch the recovery service used functions, which determine if the bucket can be recovered, one by one in `recovery_step_by_type`. Moreover, the function was called from `recovery_service_f` with a lot of code duplication. In the following patches the logic of recovery service will become more complex. In order to preserve the code readability it was decided to: 1) Used for-loop based checking instead of one by one based checking of buckets' recoverability. 2) Join the logic of saving recovered bucket ids and changing of `_bucket` space into one separate function - `bucket_recover`. The functions `bucket_recover` and `bucket_status_has_destination` will be used in the future commits. Needed for tarantool#214 Needed for tarantool#573 NO_TEST=refactoring NO_DOC=refactoring Co-authored-by: Nikita Zheleztsov <n.zheleztsov@proton.me>

This commit adds new READONLY bucket status. It's allowed to have old RW refs, which were created before making the bucket READONLY, on top of it. However, creating new ones is prohibited. It's the state, in which we're waiting for RW requests to end before sending bucket. Later we'll introduce waiting on replicas too. For details, please check out the following RFC: tarantool#620. The READONLY state replaces the manual rw_lock-ing of the buckets during bucket sending, so this code can be dropped. Also, the code for `bucket_refrw_touch` can be also dropped: Before this commit ref has always been created, because there was no other way to set `rw_lock` on bucket other than ref. Now, we have "persisted rw_lock", which is the READONLY bucket state, and we don't have to create the ref forcefully anymore. However, we still must check, that the bucket is ok, while sending the bucket, so during preparation phase of bucket send it was replaced with `bucket_check_state()`. The test from bucket_ref is removed, since it tests, that rebalancer continues to work, when the storage has READONLY bucket status, which is not true. Rebalancer must wait for 0 SENDING, RECEIVING and READONLY buckets in the cluster and only after that it can try rebalancing one more time. Part of tarantool#573 NO_DOC=internal

The check for master is duplicated in the `bucket_send()` function. It was done explicitly at the beginning of the function and then one more time during `bucket_check_state()` (previously during `bucket_refrw_touch()`). Let's drop the first check. It is already tested in the `storage_1_1_test.test_master_exclusive_api`. NO_DOC=internal NO_TEST=tested

This commit adds new functions, which are meant to be used instead of the bucket_transfer_start/end() when sending a bucket. This is needed for tarantool#573, where rebalancer will start to check the refs on replicas, for which proper bucket batching is needed, since it'll be way too costly to do that before every bucket sending. For that to work we need to mark the bucket as `transfering` and block new RW refs (`READONLY`) far earlier, than `bucket_send` happens. The first one is required in order to simplify the process of picking buckets and helps not to pick the same one twice. The second one is essential, since we don't want new refs to happen, when we picked the bucket for sending. Needed for tarantool#573 NO_DOC=refactoring NO_TEST=refactoring

Route dispenser is a simple container for routes, let's move it out from the storage file to the separate one in the scope of tarantool#263. While we're here let's also rewrite its test to luatest in the scope of tarantool#371. Part of tarantool#263 Part of tarantool#371 NO_DOC=refactoring NO_TEST=refactoring

This commit builds the ground for batching of buckets before sending in the rebalancer workers, which is needed for tarantool#333, tarantool#573, tarantool#576. From now on worker picks N buckets, prepares them in batch and only after that all of the workers start sending these prepared buckets. The maximum buckets, unavailable for writing during rebalancing remains `rebalancer_max_sending`. The test checks, that batching properly works, when error happens during preparation stage. Error during sending stage is already tested with rebalancer/errinj test. There's also the test, which checks, that in case of slow wakeup, the worker continues its work properly. Needed for tarantool#333 Needed for tarantool#573 Needed for tarantool#576 NO_DOC=refactoring

Before this commit the rebalancer picked the first encountered bucket and waited for number of rw refs on it to become 0. This doesn't work reliably, when cluster has very long RW requests, which block bucket rebalancing for hours: the instance picks the same buckets and timeouts, while waiting for no refs. In order to fix that let's prefer buckets without refs. If there's not enough such buckets, take the remainings from the first available buckets, as it worked before. Closes tarantool#351 NO_DOC=bugfix

Before this patch it was possible to break replication by long RW requests in master switches: there was a master, which processed some RW requests, after that the master was changed (which is pretty casual thing, when rebalancing is in progress due to high load on storages), then a new master started sending bucket, but replica has RW ref on it, this breaks replication and requires manual intervention from the user (replication becomes stopped and must be manually restarted). In order to fix that we require all replicas to be running, when rebalancing happens. Before making a bucket SENDING it becomes READONLY and a master waits for RW requests on replicas to end. In READONLY state it's prohibited to create new RW requests for that bucket. For details see the tarantool#620 RFC. The commit is also needed for fixing the doubled buckets during master switch tarantool#576. By syncing with all replicas we make sure, that all of them has the bucket at least in READONLY state. So, after master switch happens the bucket router won't be able to send RW requests to it. Together with service, which synchronizes the `_bucket` space before any recovery/gc/rebalancing happens, it will solve the tarantool#576. Note, that from now on we cannot just use the small timeout on `bucket_send` in tests, it will trigger timeouting of the sync with replicas instead on the bucket sending, so in some tests the timeout is increased. Closes tarantool#573 Part of tarantool#576 NO_DOC=bugfix

The test is pretty flaky and fails with error "Duplicate key exists in unique index 'pk' in space '_bucket'", when trying to force create buckets. Let's properly wait for all buckets to be garbage collected and not just one, as it was before. NO_DOC=test

Sometimes the test failed with `Not found` error, when trying to send bucket from `replica_2_a` right after its start. The problem there is that firstly the node should get that bucket from `replica_2_b` via replication. It'll be fixed automatically with introduction of the service, which wait for `_bucket` synchronization, but we're not there yet, so let's fix it manually. Closes tarantool#528 NO_DOC=test

Serpentian · 2026-05-13T19:31:44Z

Fixed not building the rocks bundle properly due to missing route_dispenser.lua in the CMake. Found in the integration testing.
Fixed flaky auto_master_2_2_2, which was found in the catch_flaky workflow

Gerold103 · 2026-05-14T11:05:30Z

 ---
 ...
-_, err = vshard.storage.bucket_send(101, util.replicasets[1], {timeout = 0.1})
+_, err = vshard.storage.bucket_send(101, util.replicasets[1], {timeout = 1})


My only comment is that it seems not right that we use one timeout for sync + data send. Sync is entirely bottlenecked by the replication and doesn't depend on bucket size. OTOH, the data send can be quite longer, and it does depend on the bucket size.

It seems logical and convenient, if we would add a new option sync_timeout. It would be used to wait for the sync. And the timeout would be used for sending.

Alternatively, to make it more explicit, we could make 3 options sync_timeout + send_timeout + timeout.

If sync_timeout and timeout are specified, we calculate send_timeout = timeout - sync_timeout.

If send_timeout and timeout are specified, we calculate sync_timeout = timeout - send_timeout.

If timeout is not specified, we take it as default, same as now. Then we apply the rules above.

If all 3 are specified, then we ignore timeout. Or raise an error.

This will help us to make the sync timeout in tests very high, and send - very small. And the tests can remain fast.

Serpentian and others added 6 commits April 13, 2026 15:50

ci: use luatest 1.4.1 instead of the latest

30ce192

The Luatest 1.4.2 has broken the compatibility with Tarantool 1.10. See tarantool/luatest#453 for details. Let's use 1.4.1 for now. NO_DOC=ci NO_TEST=ci

Serpentian added the full-ci label May 13, 2026

Serpentian added 5 commits May 13, 2026 21:31

Serpentian force-pushed the doubled-buckets-part-0 branch 2 times, most recently from bc20424 to d4e795f Compare May 13, 2026 19:15

Serpentian force-pushed the doubled-buckets-part-0 branch from d4e795f to 39b7141 Compare May 13, 2026 19:25

Serpentian mentioned this pull request May 13, 2026

Doubled buckets. Part 1 (rebalancer batching + recovery) #633

Open

Serpentian requested a review from Gerold103 May 13, 2026 20:04

Serpentian assigned Gerold103 May 13, 2026

Gerold103 reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doubled buckets. Part 0 (rebalancer batching + ref inconsistencies during master switch)#649

Doubled buckets. Part 0 (rebalancer batching + ref inconsistencies during master switch)#649
Serpentian wants to merge 12 commits into
tarantool:masterfrom
Serpentian:doubled-buckets-part-0

Serpentian commented May 13, 2026 •

edited

Loading

Uh oh!

Serpentian commented May 13, 2026 •

edited

Loading

Uh oh!

Gerold103 May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Serpentian commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Serpentian commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gerold103 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Serpentian commented May 13, 2026 •

edited

Loading

Serpentian commented May 13, 2026 •

edited

Loading