Skip to content

Commit 8ad5f86

Browse files
authored
[JENKINS-49707] Agent missing after controller restart to fail resumption of node step, not kill whole build (#180)
* Sketch of non-`Pickle`-based resumption of `ExecutorStepExecution` * Amending test from #34 * Fixing `Callback` * Exploring where to detect a failure eligible for retry * Making `normalNodeDisappearance` pass under somewhat altered circumstances * First successful `node` block retry * `FilePathDynamicContext` must take precedence over `ExecutorStepDynamicContext.FilePathTranslator` * `ExecutorStepTest.unloadableExecutorPickle` as currently conceived no longer makes sense * Fixing `ExecutorStepTest.nodeDisconnectMissingContextVariableException` by copying some diagnostic code from `FilePathDynamicContext` * Fixing `ExecutorStepTest.contextualizeFreshFilePathAfterAgentReconnection` * Marking `ExecutorStepRetryEligibility` beta * Javadoc imports * SpotBugs: does not make sense to allow `ExecutorStepDynamicContext.node` to be null * Reasonable behavior of `ExecutorPickleTest.canceledQueueItem` * User-visible logging about 5m timeout now that `ExecutorPickle.printWaitingMessage` is unused * `ExecutorPickleTest` → `ExecutorStepDynamicContextTest` * Move `TIMEOUT_WAITING_FOR_NODE_MILLIS` from `ExecutorPickle` to `ExecutorStepExecution` * Some user-visible logging when `ExecutorStepRetryEligibility` is _not_ used * Do not block on `FutureImpl`. After jenkinsci/workflow-step-api-plugin#73 this caused `SecretsMasker` to block on a `KubernetesComputer` which in turn blocked provisioning of a `TaskListener`, causing `RestartPipelineTest.terminatedPodAfterRestart` to fail to find a message: ``` FINE o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: rediscovering that terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed and timeout has expired FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: could not get TaskListener in CpsStepContext[9:sh]:Owner[terminated Pod After Restart/1:terminated Pod After Restart #1] java.util.concurrent.TimeoutException at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:102) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:115) Caused: java.io.IOException at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:118) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:97) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet$1Delegate.doGet(ContextVariableSet.java:98) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:85) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:73) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:95) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.CpsThread.getContextVariable(CpsThread.java:135) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:297) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.getListener(DefaultStepContext.java:127) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:87) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution._listener(DurableTaskStep.java:421) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.listener(DurableTaskStep.java:412) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:363) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:570) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:553) at … FINE o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener#getLogger: creating filtered stream FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed for 15 sec, assuming it is not coming back ``` * A `sh` step could fail to find a workspace at the moment it starts. java.lang.NullPointerException at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:329) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at … * Also handling `ClosedChannelException` as commonly thrown by `SimpleBuildStep`s and the like: jenkinsci/kubernetes-plugin#1083 (comment) * Also retry after `MissingContextVariableException` on `FilePath` * Remove `super.onResume` left over in #46; see jenkinsci/workflow-step-api-plugin#66 * Documenting need for JENKINS-30383 * jenkinsci/workflow-step-api-plugin#77 allows `retryNodeBlockSynchAcrossRestarts` to be fixed * jenkinsci/workflow-step-api-plugin#77 released * Expanding `MissingContextVariableException` type list after noticing that `isUnix` requests `Launcher` not `FilePath` * Never log a message with only `PlaceholderTask.cookie`; also need `.runId` for context * Suppressing uninteresting log message * Maybe need to call `StepContext.saveState`? * Deleting comment; for now it seems clearest to keep `ExecutorStepDynamicContext` as a distinct type * Resolve longstanding tech debt from #104 & #117 about `BodyExecution`. When we do not use pickles, there is no need for `transient`. Moving the reference to `ExecutorStepExecution` from `PlaceholderTask` avoids a memory leak. This is only possible now that `Callback` holds the `ExecutorStepExecution`. And we can now properly handle `AsynchronousExecution.interrupt`. * Sketch of updating `WorkspaceStepExecution`: #180 (comment) * Allowing `FilePathDynamicContext` to block waiting for an agent to reconnect. Not great since the CPS VM thread will only grant 5s. Better would be to avoid unnecessary use of `StepContext.get(FilePath.class)` (which implicitly requires a live `Channel`); perhaps `FilePathRepresentation` should become an API. Amends b65ff1c. #180 (comment) * Adding a comment about 8b08398 * 8b08398 seems to allow a968adf to be finished. * Refining 8b08398 to avoid tail call * `org.jenkinsci.plugins.workflow.support.concurrent.Futures` deprecated * SpotBugs reminds me that `WorkspaceStepExecution.Callback` needs an SVUID; taking one from trunk * Adding `NodeTranslator` * Beginning rewrite to `AgentErrorCondition` * Extracted `RetryExecutorStepTest` from `ExecutorStepTest` * Rely on blocking `ExecutorStepExecution.onResume` * Removing bogus `waitForMessage` (and unnecessary `interrupt`) from `unrestorableAgent`. Sometimes a failing `waitForMessage` seems to trigger https://maven.apache.org/surefire/maven-surefire-plugin/faq.html#corruptedstream though I cannot track down why. * `PlaceholderTask` may not hold a reference to `ExecutorStepExecution` * Equivalent of #184 for `ExecutorStepDynamicContext` * Assert jenkinsci/workflow-api-plugin@b1778a9 * jenkinsci/workflow-cps-plugin#534 & jenkinsci/workflow-job-plugin#260 released * Windows tests jenkinsci/workflow-basic-steps-plugin#203 (comment) * Avoiding `powershell` jenkinsci/workflow-basic-steps-plugin#203 (comment) * Better usage of Hamcrest; got a flake https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=6779178701 with `QueueTaskCancelled` instead of `RemovedNodeCause` * In fact the flake was in a different test (`canceledQueueItem`) * Forgot to make `canceledQueueItem` not use `sh`, though it should not really matter since this step is expected to fail anyway * Test flake: https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=7240329028 * Use `AgentOfflineException` from `ExecutorStepDynamicContext` as well * Can now run all of `AgentErrorConditionTest` * Responding to `CancellationException` in `ExecutorStepDynamicContext` with `QueueTaskCancelled`, analogous to handling in `ExecutorPickle` * Permit `canceledQueueItem` to pass on `RemovedNodeCause` * Normalizing indentation * jenkinsci/workflow-step-api-plugin#124 deprecates an overload which 2e95e3a only handles at one call site; TBD if others should be switched to `actualInterruption` * Refined `ExecutorStepDynamicContext` behavior in case of `quietingDown` #180 (comment) or `terminating` #180 (comment) * Add an explicit `serialVersionUID` to `ExecutorStepExecution.PlaceholderTask.Callback` #180 (comment) * Optimizing `withExecution` to only consider steps in the current build #180 (comment) * Pick up jenkinsci/workflow-api-plugin#256 #180 (comment) * jenkinsci/workflow-api-plugin#256 released
1 parent e663bd5 commit 8ad5f86

File tree

15 files changed

+592
-277
lines changed

15 files changed

+592
-277
lines changed

pom.xml

+1
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@
9494
<dependency>
9595
<groupId>org.jenkins-ci.plugins.workflow</groupId>
9696
<artifactId>workflow-api</artifactId>
97+
<version>1200.v8005c684b_a_c6</version>
9798
</dependency>
9899
<dependency>
99100
<groupId>org.jenkins-ci.plugins.workflow</groupId>

src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java

+7-4
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,6 @@
8181
import org.jenkinsci.plugins.workflow.steps.StepExecution;
8282
import org.jenkinsci.plugins.workflow.support.concurrent.Timeout;
8383
import org.jenkinsci.plugins.workflow.support.concurrent.WithThreadName;
84-
import org.jenkinsci.plugins.workflow.support.pickles.ExecutorPickle;
8584
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution;
8685
import org.kohsuke.accmod.Restricted;
8786
import org.kohsuke.accmod.restrictions.DoNotUse;
@@ -306,6 +305,9 @@ static final class Execution extends AbstractStepExecutionImpl implements Runnab
306305
returnStatus = step.returnStatus;
307306
StepContext context = getContext();
308307
ws = context.get(FilePath.class);
308+
if (ws == null) {
309+
throw new AbortException("No workspace currently accessible");
310+
}
309311
node = FilePathUtils.getNodeName(ws);
310312
DurableTask durableTask = step.task();
311313
if (returnStdout) {
@@ -356,12 +358,13 @@ static final class Execution extends AbstractStepExecutionImpl implements Runnab
356358
LOGGER.fine(() -> "discovered that " + node + " has been removed");
357359
removedNodeDiscovered = System.nanoTime();
358360
return null;
359-
} else if (System.nanoTime() - removedNodeDiscovered < TimeUnit.MILLISECONDS.toNanos(ExecutorPickle.TIMEOUT_WAITING_FOR_NODE_MILLIS)) {
361+
} else if (System.nanoTime() - removedNodeDiscovered < TimeUnit.MILLISECONDS.toNanos(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS)) {
360362
LOGGER.fine(() -> "rediscovering that " + node + " has been removed");
361363
return null;
362364
} else {
363-
LOGGER.fine(() -> node + " has been removed for a while, assuming it is not coming back");
364-
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.RemovedNodeCause());
365+
LOGGER.fine(() -> "rediscovering that " + node + " has been removed and timeout has expired");
366+
listener().getLogger().println(node + " has been removed for " + Util.getTimeSpanString(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS) + ", assuming it is not coming back");
367+
throw new FlowInterruptedException(Result.ABORTED, /* TODO false probably more appropriate */true, new ExecutorStepExecution.RemovedNodeCause());
365368
}
366369
}
367370
removedNodeDiscovered = 0; // something else; reset

src/main/java/org/jenkinsci/plugins/workflow/support/pickles/ComputerPickle.java

+3-1
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,14 @@
3131
import hudson.model.Computer;
3232
import jenkins.model.Jenkins;
3333
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
34+
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;
3435

3536
/**
3637
* Reference to {@link Computer}
3738
*
38-
* @author Kohsuke Kawaguchi
39+
* @deprecated Normally now done via {@link ExecutorStepDynamicContext}.
3940
*/
41+
@Deprecated
4042
public class ComputerPickle extends Pickle {
4143
private final String slave;
4244

src/main/java/org/jenkinsci/plugins/workflow/support/pickles/ExecutorPickle.java

+6-13
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,7 @@
2626

2727
import com.google.common.util.concurrent.ListenableFuture;
2828
import edu.umd.cs.findbugs.annotations.NonNull;
29-
import edu.umd.cs.findbugs.annotations.SuppressFBWarnings;
3029
import hudson.Extension;
31-
import hudson.Main;
3230
import hudson.Util;
3331
import hudson.init.InitMilestone;
3432
import hudson.model.Executor;
@@ -41,35 +39,30 @@
4139
import hudson.model.queue.SubTask;
4240

4341
import java.util.concurrent.Future;
44-
import java.util.concurrent.TimeUnit;
4542
import java.util.logging.Level;
4643
import java.util.logging.Logger;
4744

4845
import jenkins.model.Jenkins;
4946
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
5047
import org.jenkinsci.plugins.workflow.pickles.Pickle;
5148
import org.jenkinsci.plugins.workflow.steps.FlowInterruptedException;
52-
import org.jenkinsci.plugins.workflow.steps.durable_task.Messages;
49+
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;
5350
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution;
54-
import org.kohsuke.accmod.Restricted;
55-
import org.kohsuke.accmod.restrictions.NoExternalUse;
5651

5752
/**
5853
* Persists an {@link Executor} as the {@link hudson.model.Queue.Task} it was running.
5954
* That task can in turn have some way of producing a display name, a special {@link hudson.model.Queue.Executable} with a custom {@code executorCell.jelly}, and so on.
6055
* When rehydrated, the task is rescheduled, and when it starts executing the owning executor is produced.
6156
* Typically the {@link SubTask#getAssignedLabel} should be a {@link Node#getSelfLabel} so that the rehydrated executor is in fact on the same node.
57+
* @deprecated Normally now done via {@link ExecutorStepDynamicContext}.
6258
*/
59+
@Deprecated
6360
public class ExecutorPickle extends Pickle {
6461

6562
private static final Logger LOGGER = Logger.getLogger(ExecutorPickle.class.getName());
6663

6764
private final Queue.Task task;
6865

69-
@SuppressFBWarnings(value = "MS_SHOULD_BE_FINAL", justification = "deliberately mutable")
70-
@Restricted(NoExternalUse.class)
71-
public static long TIMEOUT_WAITING_FOR_NODE_MILLIS = Main.isUnitTest ? /* fail faster */ TimeUnit.SECONDS.toMillis(15) : Long.getLong(ExecutorPickle.class.getName()+".timeoutForNodeMillis", TimeUnit.MINUTES.toMillis(5));
72-
7366
private ExecutorPickle(Executor executor) {
7467
if (executor instanceof OneOffExecutor) {
7568
throw new IllegalArgumentException("OneOffExecutor not currently supported");
@@ -105,7 +98,7 @@ protected Executor tryResolve() throws Exception {
10598
throw new IllegalStateException("queue refused " + task);
10699
}
107100
itemID = item.getId();
108-
endTimeNanos = System.nanoTime() + TIMEOUT_WAITING_FOR_NODE_MILLIS*1000000;
101+
endTimeNanos = System.nanoTime() + ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS*1000000;
109102
LOGGER.log(Level.FINE, "{0} scheduled {1}", new Object[] {ExecutorPickle.this, item});
110103
} else {
111104
item = Queue.getInstance().getItem(itemID);
@@ -129,7 +122,7 @@ protected Executor tryResolve() throws Exception {
129122
if (System.nanoTime() > endTimeNanos) {
130123
Queue.getInstance().cancel(item);
131124
owner.getListener().getLogger().printf("Killed %s after waiting for %s because we assume unknown agent %s is never going to appear%n",
132-
item.task.getDisplayName(), Util.getTimeSpanString(TIMEOUT_WAITING_FOR_NODE_MILLIS), placeholder.getAssignedLabel());
125+
item.task.getDisplayName(), Util.getTimeSpanString(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS), placeholder.getAssignedLabel());
133126
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.RemovedNodeCause());
134127
}
135128
}
@@ -159,7 +152,7 @@ protected Executor tryResolve() throws Exception {
159152
}
160153
@Override protected void printWaitingMessage(@NonNull TaskListener listener) {
161154
Queue.Item item = Queue.getInstance().getItem(itemID);
162-
String message = Messages.ExecutorPickle_waiting_to_resume(task.getFullDisplayName());
155+
String message = "Waiting to resume " + task.getFullDisplayName();
163156
if (item == null) { // ???
164157
listener.getLogger().println(message);
165158
return;

src/main/java/org/jenkinsci/plugins/workflow/support/pickles/FilePathPickle.java

+4
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,14 @@
3131
import org.jenkinsci.plugins.workflow.FilePathUtils;
3232
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
3333
import org.jenkinsci.plugins.workflow.pickles.Pickle;
34+
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;
35+
import org.jenkinsci.plugins.workflow.support.steps.FilePathDynamicContext;
3436

3537
/**
3638
* @author Kohsuke Kawaguchi
39+
* @deprecated Normally now done via {@link ExecutorStepDynamicContext} or {@link FilePathDynamicContext}.
3740
*/
41+
@Deprecated
3842
public class FilePathPickle extends Pickle {
3943

4044
private final String slave;

src/main/java/org/jenkinsci/plugins/workflow/support/pickles/WorkspaceListLeasePickle.java

+5
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,12 @@
3535
import org.jenkinsci.plugins.workflow.FilePathUtils;
3636
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
3737
import org.jenkinsci.plugins.workflow.pickles.Pickle;
38+
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;
3839

40+
/**
41+
* @deprecated Normally now done via {@link ExecutorStepDynamicContext}.
42+
*/
43+
@Deprecated
3944
public class WorkspaceListLeasePickle extends Pickle {
4045

4146
// Could perhaps just store the FilePath directly (thus using FilePathPickle implicitly), but we need the Computer anyway for its WorkspaceList:

0 commit comments

Comments
 (0)