Skip to content

Doubled buckets. Part 1 (rebalancer batching + recovery)#633

Open
Serpentian wants to merge 17 commits into
tarantool:masterfrom
Serpentian:gh-351-reblancer-takes-buckets-with-no-refs
Open

Doubled buckets. Part 1 (rebalancer batching + recovery)#633
Serpentian wants to merge 17 commits into
tarantool:masterfrom
Serpentian:gh-351-reblancer-takes-buckets-with-no-refs

Conversation

@Serpentian
Copy link
Copy Markdown
Collaborator

@Serpentian Serpentian commented Jan 22, 2026

Part of #619 RFC and full #620.

Closes #351
Closes #573
Part of #214

@Serpentian Serpentian marked this pull request as draft January 26, 2026 12:43
@Serpentian Serpentian force-pushed the gh-351-reblancer-takes-buckets-with-no-refs branch from efe8655 to 15c2a62 Compare January 26, 2026 14:46
@Serpentian
Copy link
Copy Markdown
Collaborator Author

Serpentian commented Jan 26, 2026

  • Fixed incorrect error handling and order of checks in the second commit, now all tests pass without changes there
  • Fixed the same bucket choosen twice in the fourth commit, fixed error handling there

And a bunch of other bugs. Seems like a working solution now, ready for review

@Serpentian Serpentian marked this pull request as ready for review January 26, 2026 14:59
@Serpentian Serpentian force-pushed the gh-351-reblancer-takes-buckets-with-no-refs branch 4 times, most recently from 5ca27b9 to 744a3fe Compare January 27, 2026 17:03
@Serpentian Serpentian assigned kamenkremen and mrForza and unassigned Serpentian Jan 28, 2026
@Serpentian Serpentian requested a review from Gerold103 February 2, 2026 14:51
Copy link
Copy Markdown
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yeah, it is nice to get this debt fixed 😁💪.

Comment thread vshard/storage/init.lua Outdated
Comment thread vshard/storage/init.lua Outdated
@Serpentian Serpentian force-pushed the gh-351-reblancer-takes-buckets-with-no-refs branch 4 times, most recently from 49cbc0c to db7e444 Compare May 12, 2026 17:30
@Serpentian Serpentian changed the title Doubled buckets. Part 1 (master switch) Doubled buckets. Part 1 (rebalancer batching + recovery) May 12, 2026
Copy link
Copy Markdown
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

The patch is getting quite big (good), I hope my reviews aren't getting too chaotic and scattered.

Comment thread test/storage-luatest/log_verbosity_2_2_test.lua Outdated
Comment thread test/storage-luatest/bucket_triggers_test.lua
Comment thread test/storage-luatest/bucket_triggers_test.lua Outdated
Comment thread test/storage-luatest/bucket_triggers_test.lua Outdated
Comment thread test/storage-luatest/rebalancer_test.lua Outdated
Comment thread vshard/storage/init.lua Outdated
Comment thread vshard/replicaset.lua
Comment thread vshard/replicaset.lua Outdated
Comment thread test/replicaset-luatest/replicaset_2_2_test.lua Outdated
Comment thread vshard/replicaset.lua Outdated
@Serpentian
Copy link
Copy Markdown
Collaborator Author

The patch is getting quite big (good)

We can shrink it, if you don't have any new comments for the first part. Here's the PR:
#649

P.S. Fixed the comments, will push the last commit with recovery tomorrow

Serpentian and others added 17 commits May 14, 2026 09:59
The Luatest 1.4.2 has broken the compatibility with Tarantool 1.10. See
tarantool/luatest#453 for details. Let's use 1.4.1 for now.

NO_DOC=ci
NO_TEST=ci
Before this patch we woke up the GC service instead of recovery service
in `bucket_recovery_pause`. It could lead to a longer tests' execution
time. Now, we fix it by changing `garbage_collector_wakeup` to
`recovery_wakeup`.

NO_TEST=refactoring
NO_DOC=refactoring
Before this patch the recovery service used functions, which determine
if the bucket can be recovered, one by one in `recovery_step_by_type`.
Moreover, the function was called from `recovery_service_f` with a lot
of code duplication. In the following patches the logic of recovery
service will become more complex. In order to preserve the code
readability it was decided to:

  1) Used for-loop based checking instead of one by one based checking
     of buckets' recoverability.

  2) Join the logic of saving recovered bucket ids and changing of
     `_bucket` space into one separate function - `bucket_recover`.

The functions `bucket_recover` and `bucket_status_has_destination` will
be used in the future commits.

Needed for tarantool#214
Needed for tarantool#573

NO_TEST=refactoring
NO_DOC=refactoring

Co-authored-by: Nikita Zheleztsov <n.zheleztsov@proton.me>
This commit adds new READONLY bucket status. It's allowed to have old RW
refs, which were created before making the bucket READONLY, on top of
it. However, creating new ones is prohibited. It's the state, in which
we're waiting for RW requests to end before sending bucket. Later we'll
introduce waiting on replicas too.  For details, please
check out the following RFC: tarantool#620.

The READONLY state replaces the manual rw_lock-ing of the buckets during
bucket sending, so this code can be dropped. Also, the code for
`bucket_refrw_touch` can be also dropped: Before this commit ref has
always been created, because there was no other way to set `rw_lock` on
bucket other than ref. Now, we have "persisted rw_lock", which is the
READONLY bucket state, and we don't have to create the ref forcefully
anymore. However, we still must check, that the bucket is ok, while
sending the bucket, so during preparation phase of bucket send it was
replaced with `bucket_check_state()`.

The test from bucket_ref is removed, since it tests, that rebalancer
continues to work, when the storage has READONLY bucket status, which is
not true. Rebalancer must wait for 0 SENDING, RECEIVING and READONLY
buckets in the cluster and only after that it can try rebalancing one
more time.

Part of tarantool#573

NO_DOC=internal
The check for master is duplicated in the `bucket_send()` function. It
was done explicitly at the beginning of the function and then one more
time during `bucket_check_state()` (previously during
`bucket_refrw_touch()`).

Let's drop the first check. It is already tested in the
`storage_1_1_test.test_master_exclusive_api`.

NO_DOC=internal
NO_TEST=tested
This commit adds new functions, which are meant to be used instead of
the bucket_transfer_start/end() when sending a bucket.

This is needed for tarantool#573, where rebalancer will start to check the refs
on replicas, for which proper bucket batching is needed, since it'll be
way too costly to do that before every bucket sending.

For that to work we need to mark the bucket as `transfering` and block
new RW refs (`READONLY`) far earlier, than `bucket_send` happens. The
first one is required in order to simplify the process of picking
buckets and helps not to pick the same one twice. The second one is
essential, since we don't want new refs to happen, when we picked the
bucket for sending.

Needed for tarantool#573

NO_DOC=refactoring
NO_TEST=refactoring
Route dispenser is a simple container for routes, let's move it out from
the storage file to the separate one in the scope of tarantool#263. While
we're here let's also rewrite its test to luatest in the scope of tarantool#371.

Part of tarantool#263
Part of tarantool#371

NO_DOC=refactoring
NO_TEST=refactoring
This commit builds the ground for batching of buckets before sending in
the rebalancer workers, which is needed for tarantool#333, tarantool#573, tarantool#576. From now
on worker picks N buckets, prepares them in batch and only after that
all of the workers start sending these prepared buckets. The maximum
buckets, unavailable for writing during rebalancing remains
`rebalancer_max_sending`.

The test checks, that batching properly works, when error happens during
preparation stage. Error during sending stage is already tested with
rebalancer/errinj test. There's also the test, which checks, that in
case of slow wakeup, the worker continues its work properly.

Needed for tarantool#333
Needed for tarantool#573
Needed for tarantool#576

NO_DOC=refactoring
Before this commit the rebalancer picked the first encountered bucket
and waited for number of rw refs on it to become 0. This doesn't work
reliably, when cluster has very long RW requests, which block bucket
rebalancing for hours: the instance picks the same buckets and timeouts,
while waiting for no refs.

In order to fix that let's prefer buckets without refs. If there's not
enough such buckets, take the remainings from the first available buckets,
as it worked before.

Closes tarantool#351

NO_DOC=bugfix
Before this patch it was possible to break replication by long RW
requests in master switches: there was a master, which processed some RW
requests, after that the master was changed (which is pretty casual
thing, when rebalancing is in progress due to high load on storages),
then a new master started sending bucket, but replica has RW ref on it,
this breaks replication and requires manual intervention from the user
(replication becomes stopped and must be manually restarted).

In order to fix that we require all replicas to be running, when
rebalancing happens. Before making a bucket SENDING it becomes READONLY
and a master waits for RW requests on replicas to end. In READONLY state
it's prohibited to create new RW requests for that bucket. For details
see the tarantool#620 RFC.

The commit is also needed for fixing the doubled buckets during master
switch tarantool#576. By syncing with all replicas we make sure, that all of them
has the bucket at least in READONLY state. So, after master switch
happens the bucket router won't be able to send RW requests to it.
Together with service, which synchronizes the `_bucket` space before any
recovery/gc/rebalancing happens, it will solve the tarantool#576.

Note, that from now on we cannot just use the small timeout on
`bucket_send` in tests, it will trigger timeouting of the sync with
replicas instead on the bucket sending, so in some tests the timeout is
increased.

Closes tarantool#573
Part of tarantool#576

NO_DOC=bugfix
The test is pretty flaky and fails with error "Duplicate key exists in
unique index 'pk' in space '_bucket'", when trying to force create
buckets.

Let's properly wait for all buckets to be garbage collected and not just
one, as it was before.

NO_DOC=test
Sometimes the test failed with `Not found` error, when trying to send
bucket from `replica_2_a` right after its start. The problem there is
that firstly the node should get that bucket from `replica_2_b` via
replication.

It'll be fixed automatically with introduction of the service, which
wait for `_bucket` synchronization, but we're not there yet, so let's
fix it manually.

Closes tarantool#528

NO_DOC=test
This commit fixes the flakiness of the `log_verbosity_2_2_test`:

 1. Makes the messages for log grepping more specific, in order to
    exclude finding the same prefix with different suffixes several
    times.
 2. Increases the timeout for `wait_log_exactly_once`, since sometimes
    it was not enough time for message to be logged.
 3. Flushes the ratelimiters before grepping the log, since the same
    error code could be already logged previously.

Needed for tarantool#214

NO_DOC=test

Co-authored-by: Nikita Zheleztsov <n.zheleztsov@proton.me>
Before this patch the recovery service decided whether the bucket should
be recovered only based on its status on sender and receiver node and
`rebalancer_transfering_buckets` system table. This approach could lead
to doubled buckets, when the recovery service didn't find a remote
bucket on the destination node and recovered local bucket into "active"
state.

In order to partially fix this issue we extend the `_bucket` system
space by adding a new field - `opts` of map type. We make it nullable
for backward compatibility with old vshard versions and add one table
key - `generation`. When the bucket is transferred from one node to
another, the `bucket_send` increments the generation of bucket by 1.
The `bucket_recv` should persist its generation.

Part of tarantool#214

NO_DOC=internal

Co-authored-by: Nikita Zheleztsov <n.zheleztsov@proton.me>
In this patch we extract a block of code responsible for sending
asynchronous requests and waiting corresponding responses from
`replicaset_map_call` into the separate function - `replicas_map_call`.

It was done because in the next patch - tarantoolgh-214 we will need to have a
function which performs map-reduce along the masters of cluster. The
`replicas_map_call` can be reused in it.

Needed for tarantool#214

NO_TEST=refactoring
NO_DOC=refactoring
In the patch tarantoolgh-214 the storage need to perform cluster map-reduce
across masters of replicasets. In order to make it possible we introduce
the new function to `replicaset` module - `map_masters_call`. This
function waits until all given masters are connected then calls a
function on them.

Note, that the commit slightly refactors `replicas_map_call`, since from
now on there may be passed not only `replica_id` as keys in `replicas`
tables, but also `replicaset_id`.

Part of tarantool#214

NO_DOC=internal

Co-authored-by: Nikita Zheleztsov <n.zheleztsov@proton.me>
The commit should fix doubled bucket problem, which happens due to stray
TCP, described in the tarantool#214.

When bucket is sent, its generation is incremented (we make the bucket
SENDING and increment generation in one transaction) and is sent
alongside the data of the bucket to the bucket_recv, the receiver side
persists that generation in the _bucket.

Recovery uses that generation in order to distinguish, which bucket is
more recent, if it cannot find a bucket on the sender node. So, firstly,
the node goes to the sender, if there's a bucket with any state and
greater generation, local one is GARBAGE, we don't care about the status
here. If bucket generation is equal to the local one, we use the old
logic, if the bucket is missing from remote node, then fullscan all
masters of the cluster. When all of the nodes replied, if there exists
higher generation, the local is GARBAGE, ACTIVE otherwise.

Part of tarantool#214

NO_DOC=bugfix

Co-authored-by: Nikita Zheleztsov <n.zheleztsov@proton.me>
end)
end

--
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring/rewriting of the tests is in progress, not ready yet.

Comment thread vshard/storage/init.lua
bucket_reject_update(bid, "transition to '%s' changes generation",
new_status)
bucket_reject_update(bid, "transition to '%s' changes generation " ..
"%d -> %d", new_status, new_gen, old_gen)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must be in the opts.gen part

Comment thread vshard/storage/init.lua
destination, 'vshard.storage._call',
{'recovery_bucket_stat', bucket_id},
{timeout = consts.RECOVERY_GET_STAT_TIMEOUT})
log.warn('%s - %s', json_encode(bucket), json_encode(remote_bucket))
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove!

Comment thread vshard/storage/init.lua
end
-- The replicaset_id is saved for logging purposes.
assert(not info.replicaset_id)
info.replicaset_id = rs_id
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used: remove

Comment thread test/storage/storage.test.lua Outdated
--
res, err = vshard.storage.bucket_recv( \
4, util.replicasets[2], {{box.space.test.id, {{9, 4}, {10, 4}, {1, 4}}}})
res, err = vshard.storage.bucket_recv(4, util.replicasets[2], {{box.space.test.id, {{9, 4}, {10, 4}, {1, 4}}}}, {generation = 1})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes beyond 80 symbols now. Lets keep the line wrap.

Comment thread vshard/storage/init.lua Outdated
Comment on lines +585 to +599
-- SENDING bucket is the only one, which increments the generation for
-- now, though it can be extended in the future, but the generation
-- cannot be incremented on transition to SENT, RECEIVING or GARBAGE,
-- since this will break the recovery process on the non-upgraded
-- storages, if we decide to change the recovery in the future.
local old_gen = old_bucket.opts and old_bucket.opts.generation or 0
local new_gen = new_bucket.opts and new_bucket.opts.generation or 0
if new_status == BSENDING and new_gen ~= old_gen + 1 then
bucket_reject_update(bid, "sending bucket update doesn't " ..
"increment generation")
end
if (new_status == BSENT or new_status == BGARBAGE)
and new_gen ~= old_gen then
bucket_reject_update(bid, "transition to SENT or GARBAGE " ..
"changes generation")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I understand right, that in ACTIVE -> READONLY transition this code will allow to bump the generation, even though it should not be allowed?

Perhaps a more strict code could do it via banning generation change for everything except ACTIVE -> SENDING, instead of trying to write a rule for every status. It is safer to ban everything and then make allowed exceptions, rather than allow everything and then make bans as exceptions.

Something like

local old_gen = old_bucket.opts and old_bucket.opts.generation or 0
local new_gen = new_bucket.opts and new_bucket.opts.generation or 0

if old_status == BACTIVE and new_status == BSENDING and
   old_gen + 1 ~= new_gen then
   bucket_reject_update(...)
elif old_gen ~= new_gen then
    bucket_reject_update(...)
end

Wouldn't it cover everything?

Comment thread vshard/replicaset.lua
buildall = buildall,
calculate_etalon_balance = cluster_calculate_etalon_balance,
wait_masters_connect = wait_masters_connect,
map_masters_call = map_masters_call,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to rename this function to masters_map_call(). The reason is that it would be more consistent with our existing map-calls:

  • replicas_map_call().
  • replicaset:map_call()
  • router.map_call_rw()

As you can see, the object is first, and map_call is second. Sorry for a nit.


test_group.after_all(function(g)
g.cluster:drop()
g.cluster:stop()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why?

local res, err = ivshard.storage.bucket_send(bucket_id, dest_id)
t.assert(res)
t.assert_not(err)
t.helpers.retrying({}, function()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use iwait_timeout to avoid flakiness. Default timeout in retrying() is too small. Same for bucket_send() above.

Same in all places below which also have timeout options.

Comment thread vshard/storage/init.lua
local function recovery_cluster_bucket_stat(bucket_id, opts)
local replicasets = {}
for rs_id, rs in pairs(M.replicasets) do
-- Exclude self and opts.except (destination). We already know, that
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is not public, and except is never more than one node. Perhaps then replace opts with just destination.

Also the destination might be not fully descriptive. For example, if we are recovering a RECEIVING bucket, then this will be source. Perhaps a more generic name would remove the ambiguity. Like other_rs_id or remote_rs_id?

Comment thread vshard/storage/init.lua
Comment on lines +1024 to +1026
if err then
err.replicaset_id = rs_id
return nil, err
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an assert here, that info == nil.

Comment thread vshard/storage/init.lua
Comment on lines +1024 to +1026
if err then
err.replicaset_id = rs_id
return nil, err
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an assert here, that info == nil.

Comment thread vshard/storage/init.lua
err.replicaset_id = rs_id
return nil, err
end
if not bucket_info then
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bucket_info variable is declared before the loop and is never assigned until some lines below. Which means this loop will just always go here, see not bucket_info, go to continue, and do nothing. No? Perhaps here and below it must be just info. Do tests pass?

Comment thread vshard/storage/init.lua
Comment on lines +1031 to +1034
if bucket_info and bucket_info.generation >= info.generation then
-- We're interested in the max generation among all replicasets.
goto continue
end
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a cluster having old versions of vshard we might have a situation when a bucket is ACTIVE on one rs, GARBAGE on another, both have same generation (= 0). With some bad luck, you might first notice the GARBAGE one, will ignore the ACTIVE as having same generation, and then will recover the local one to ACTIVE as well.

Maybe when generations match, it is a good idea to prefer buckets having statuses other than SENT or GARBAGE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RW refs break replication when master changes Rebalancer should firstly try to pick buckets which already have no refs

4 participants