Skip to content

Race condition in disrupt_abort_repair #11126

@cezarmoise

Description

@cezarmoise

It is possible that during disrupt_abort_repair, the call to storage_service/force_terminate_repair arives just as a repair is finished and before another one starts, so in effect it does nothing, so the nodetool repair does not fail, and it runs until finished, which may trigger the timeout of 120s on the thread.

Jun 09 23:49:17.110407 longevity-50gb-12h-2024-2-db-node-d169477f-2 scylla[6905]:  [shard  0:strm] repair - repair[dbc91c86-2a77-4c7a-91f9-b93721bb823e]: starting user-requested repair for keyspace system_distributed_everywhere, repair id 56, options {"trace": "false", "primaryRange": "false", "jobThreads": "1", "incremental": "false", "parallelism": "parallel"}
...
Jun 09 23:49:18.147909 longevity-50gb-12h-2024-2-db-node-d169477f-2 scylla[6905]:  [shard  0:strm] repair - repair[dbc91c86-2a77-4c7a-91f9-b93721bb823e]: completed successfully
Jun 09 23:49:18.282972 longevity-50gb-12h-2024-2-db-node-d169477f-2 scylla[6905]:  [shard  0:strm] repair - Started to abort repair jobs={}, nr_jobs=0
Jun 09 23:49:18.315701 longevity-50gb-12h-2024-2-db-node-d169477f-2 scylla[6905]:  [shard  0:strm] repair - repair[585f8d94-1250-4719-89ea-1f53a5bed91a]: starting user-requested repair for keyspace drop_table_during_repair_ks_6, repair id 57, options {"trace": "false", "primaryRange": "false", "jobThreads": "1", "incremental": "false", "parallelism": "parallel"}
Jun 09 23:49:18.315717 longevity-50gb-12h-2024-2-db-node-d169477f-2 scylla[6905]:  [shard  0:strm] repair - repair[585f8d94-1250-4719-89ea-1f53a5bed91a]: completed successfully: no tables to repair

Packages

Scylla version: 2024.2.11-20250609.98e7e1fec707 with build-id 0137055552a86ec74fe7808066cd25cae9b712a1
Kernel Version: 5.15.0-1085-aws

Installation details

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-50gb-12h-2024-2-db-node-d169477f-5 (34.201.146.63 | 10.12.10.226) (shards: 12)
  • longevity-50gb-12h-2024-2-db-node-d169477f-4 (13.218.33.140 | 10.12.10.191) (shards: 11)
  • longevity-50gb-12h-2024-2-db-node-d169477f-3 (18.209.14.117 | 10.12.9.2) (shards: -1)
  • longevity-50gb-12h-2024-2-db-node-d169477f-2 (98.84.130.64 | 10.12.9.121) (shards: 11)
  • longevity-50gb-12h-2024-2-db-node-d169477f-1 (13.218.246.21 | 10.12.10.20) (shards: 11)

OS / Image: ami-0abd2efc39812f7d0 (aws: undefined_region)

Test: longevity-150gb-asymmetric-cluster-12h-test
Test id: d169477f-0422-4d78-b8ce-864c35c693db
Test name: enterprise-2024.2/tier1/longevity-150gb-asymmetric-cluster-12h-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor d169477f-0422-4d78-b8ce-864c35c693db
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs d169477f-0422-4d78-b8ce-864c35c693db

Logs:

Jenkins job URL
Argus

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions