Skip to content

CASSANDRA-20581 Improved observability in AutoRepair to report both expected vs. actual repair bytes and expected vs. actual keyspaces #4126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

jaydeepkumar1984
Copy link
Contributor

@jaydeepkumar1984 jaydeepkumar1984 commented Apr 29, 2025

NOTE: This is a draft PR intended to gather early feedback. Test cases, code structure, and other refinements are not yet finalized.

Overall, we want to improve the AutoRepair scheduler's observability. The whole work will be divided into multiple PRs, with this PR adding the following two capabilities:

  1. On a node, it shows the total expected bytes to be repaired, and keeps the repaired bytes updated to show the progress
  2. The same way, exhibit the expected/actual keyspace repair plans

With this PR, we can visualize something like
image
image

The Cassandra Jira

@@ -84,13 +86,22 @@ public abstract class AutoRepairState
@VisibleForTesting
protected int skippedTablesCount = 0;
@VisibleForTesting
protected long totalBytesToRepair = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future, we can extend by adding expected/actual partitions, token ranges, etc.

List<PrioritizedRepairPlan> repairPlans = PrioritizedRepairPlan.build(keyspacesAndTablesToRepair, repairType, shuffleFunc);
List<PrioritizedRepairPlan> repairPlans = PrioritizedRepairPlan.build(keyspacesAndTablesToRepair, repairType, shuffleFunc, primaryRangeOnly);

int keyspaceRepairPlansSofar = 0;
Copy link
Contributor Author

@jaydeepkumar1984 jaydeepkumar1984 Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is cumbersome to identify if a keyspace has been completely repaired or not, due to our priority-based allotment. In most cases, when priority is not set, one keyspace=one keyspaceRepairPlans. I agree that it is hard for end users to differentiate between keyspace vs. keyspaceRepairPlans if priority is set, but in the beginning, we can target accuracy for use cases/operators that have no priority set

@VisibleForTesting
protected int totalKeyspaceRepairPlansToRepair = 0;
@VisibleForTesting
protected int keyspacesRepairPlansAlreaydRepaired = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sp: keyspacesRepairPlansAlreadyRepaired

@@ -101,6 +112,19 @@ protected RepairCoordinator getRepairRunnable(String keyspace, RepairOption opti
options, keyspace);
}

public void calcExpectedScheduleStats(List<PrioritizedRepairPlan> repairPlans)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: rename to something like calculateRepairPlanStatistics or updateScheduleStatistics to be more descriptive (feel free to ignore)

@jaydeepkumar1984 jaydeepkumar1984 force-pushed the trunk_cassandra_20581 branch 19 times, most recently from bffa9e2 to fc6bd13 Compare May 2, 2025 02:06
@smiklosovic smiklosovic changed the title [Draft] AutoRepair Observability: Displaying expected/actual repair size & key spaces plan tracker CASSANDRA-20581 [Draft] AutoRepair Observability: Displaying expected/actual repair size & key spaces plan tracker May 5, 2025
@jaydeepkumar1984 jaydeepkumar1984 force-pushed the trunk_cassandra_20581 branch 6 times, most recently from 166174a to 71ea40d Compare May 6, 2025 00:42
@jaydeepkumar1984 jaydeepkumar1984 force-pushed the trunk_cassandra_20581 branch 8 times, most recently from 2322b33 to 13b1667 Compare May 6, 2025 17:44
@jaydeepkumar1984 jaydeepkumar1984 changed the title CASSANDRA-20581 [Draft] AutoRepair Observability: Displaying expected/actual repair size & key spaces plan tracker CASSANDRA-20581 [Draft] Improved observability in AutoRepair to report both expected vs. actual repair bytes and expected vs. actual keyspaces May 6, 2025
@jaydeepkumar1984 jaydeepkumar1984 force-pushed the trunk_cassandra_20581 branch 7 times, most recently from 13c09a3 to 18a1e0a Compare May 6, 2025 23:17
@@ -1186,4 +1196,197 @@ public static Collection<Range<Token>> split(Range<Token> tokenRange, int number
}
return ranges;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the following APIs from RepairTokenRangeSplitter.java to here

* @param tableNames tables to repair for the given keyspace.
* @return Single repair plan.
*/
static List<PrioritizedRepairPlan> buildSingleKeyspacePlan(AutoRepairConfig.RepairType repairType, String keyspaceName, String ... tableNames)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this API because it was only being used in testing. Created a similar API under unit/*

@jaydeepkumar1984 jaydeepkumar1984 force-pushed the trunk_cassandra_20581 branch 2 times, most recently from 2f35462 to 390a08b Compare May 7, 2025 05:39
…al repair bytes and expected vs. actual keyspaces
@jaydeepkumar1984 jaydeepkumar1984 force-pushed the trunk_cassandra_20581 branch from 390a08b to dc743fd Compare May 7, 2025 16:42
@jaydeepkumar1984 jaydeepkumar1984 changed the title CASSANDRA-20581 [Draft] Improved observability in AutoRepair to report both expected vs. actual repair bytes and expected vs. actual keyspaces CASSANDRA-20581 Improved observability in AutoRepair to report both expected vs. actual repair bytes and expected vs. actual keyspaces May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants