-
Notifications
You must be signed in to change notification settings - Fork 474
Consolidate the ServerLock deletion code #6049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 2.1
Are you sure you want to change the base?
Conversation
Modifies ZooZap to use the same deletion logic as ServiceLock.
When the manager is gathering per table info it will attempt to use the existing tserver connection to request table information. If an exception is thrown during this request then the manager will record the failure by computing an entry in the badServers unable to be reached by the manager for long periods of time, the manager will attempt to communicate with the tserver 3 times before attempting a halt action. This commit adds a new property of max amount of halt requests that allows the manager to delete a zlock once the max halt requests have been attempted.
5943be4 to
87ae926
Compare
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java
Outdated
Show resolved
Hide resolved
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java
Outdated
Show resolved
Hide resolved
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java
Outdated
Show resolved
Hide resolved
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java
Outdated
Show resolved
Hide resolved
…er.java Co-authored-by: Dave Marion <[email protected]>
| "The number of threads used to run fault-tolerant executions (FATE)." | ||
| + " These are primarily table operations like merge.", | ||
| "1.4.3"), | ||
| MANAGER_MAX_TSERVER_HALTS("manager.max.tservers.halts", "0", PropertyType.COUNT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counts can be hard to reason about because you have no idea how long it will take to do X counts. Like if this was set to 3, three attempts could happen in 50ms or take 30min. For the 50ms case, you would probably want to give it a bit more time before deleting the lock. For the 30min case, may want to delete the lock before doing 3 attempts. A combo of time and attempts seems best, but not sure how to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked this over with @ctubbsii and settled on on using a time-based max timeout for halt requests.
The number of attempts is already based on the max rpc request timeout and number of failed communication attempts to the tserver.
If the tserver is able to successfully return information after a halt is called on it, then this code path isn't followed.
What is unclear is if connections can still be attempting to a tserver that is in the process of halting.
Moves the rpc halt request into an else statement so an RPC halt request is not attempted on a tserver without a zlock. Moved the server removal from the halted map into the try section so it doesn't get removed if the zlock removal failed.
Changes the naming for the property and some vars to better describe intent.
Switches the halt logic to be time-based since number of attempts is based on general.rpc.timeout.
Modifies ZooZap to use the same deletion logic as ServiceLock.