[DO_NOT_MERGE] Dispatch activity cancellation to worker using Nexus#9233
[DO_NOT_MERGE] Dispatch activity cancellation to worker using Nexus#9233rkannan82 wants to merge 1 commit intokannan/activity-cancel/task-definitionfrom
Conversation
a91654f to
6a655f7
Compare
d402868 to
54b6d1a
Compare
13d8513 to
eb79c81
Compare
d103589 to
417606c
Compare
eb79c81 to
d1f2abf
Compare
417606c to
6246c4e
Compare
d1f2abf to
7489808
Compare
6246c4e to
fec3c41
Compare
37a51d2 to
11b1049
Compare
fec3c41 to
1dd975d
Compare
45ac313 to
e96dbe8
Compare
11c74b6 to
2cb3108
Compare
72fe398 to
37c2a1c
Compare
ac7c57e to
89d2cf5
Compare
5a9c85e to
d1572d7
Compare
bergundy
left a comment
There was a problem hiding this comment.
Can you confirm whether you're covering cancel requests and pause requests?
Do we also care about canceling activities that we know timed out or are we letting the worker take care of that?
When will you be adding support for standalone activities too?
service/history/cancel_activity.go
Outdated
| return nil | ||
| } | ||
|
|
||
| return t.dispatchCancelTaskToWorker(ctx, task.NamespaceID, task.WorkerControlTaskQueue, taskTokens) |
There was a problem hiding this comment.
You definitely want to release the lock before making this RPC.
There was a problem hiding this comment.
Why store the worker control task queue on the task? That seems redundant if you're already reading from mutable state, which IMHO should serve as the source of truth.
service/history/cancel_activity.go
Outdated
| Operation: nexus.CancelActivitiesOperation, | ||
| Payload: &commonpb.Payload{ | ||
| Metadata: map[string][]byte{ | ||
| "encoding": []byte("proto"), |
There was a problem hiding this comment.
Use the SDK's data converter here instead of inventing your own proto encoding scheme please.
common/nexus/constants.go
Outdated
| const ( | ||
| // WorkerControlService is the Nexus service name for worker control operations. | ||
| WorkerControlService = "temporal.api.worker.v1.WorkerService" | ||
| // CancelActivitiesOperation is the operation to cancel running activities. | ||
| CancelActivitiesOperation = "cancel-activities" | ||
| ) |
There was a problem hiding this comment.
These aren't nexus properties IMHO. It's part of the worker service, not generic for Nexus.
service/history/cancel_activity.go
Outdated
|
|
||
| // buildActivityTaskTokens builds task tokens for activities that need cancellation. | ||
| // Returns the serialized task tokens and any error. | ||
| func buildActivityTaskTokens( |
There was a problem hiding this comment.
Share this function everywhere where you construct task tokens instead of reimplementing this.
service/history/cancel_activity.go
Outdated
| return err | ||
| } | ||
|
|
||
| if resp.GetRequestTimeout() != nil { |
There was a problem hiding this comment.
What about async outcomes and operation failures?
service/history/cancel_activity.go
Outdated
| "google.golang.org/protobuf/proto" | ||
| ) | ||
|
|
||
| func (t *transferQueueActiveTaskExecutor) processCancelActivityTask( |
There was a problem hiding this comment.
I would put this on the outbound queue since it sync matches into workers and would have different latency characteristics than typical internal tasks.
There was a problem hiding this comment.
That will an exception re. how we divided queues IMO.
Not sure I follow the concern re. fail to sync match, if it fails the task can retry using a user error, which can be handled in the same way as namespace rate limit error today. I think this task is best effort and will be discarded after 5mins.
Maybe I missed something on the nexus sync match dispatch flow?
There was a problem hiding this comment.
IMHO this is still outbound because it blocks on a response from a worker. We would need to allow multiple seconds of timeout for this round trip.
There was a problem hiding this comment.
ah...we wait for the response! that's the part I missed. I thought DispatchNexusTask only do the dispatch part.
yeah agree if we need to allow multiple seconds for a task, we can't use transfer queue. cc @rkannan82 sorry I missed this when reviewing the design and pointed you in the wrong direction.
service/history/cancel_activity.go
Outdated
| return nil | ||
| } | ||
|
|
||
| ctx, cancel := context.WithTimeout(ctx, taskTimeout) |
There was a problem hiding this comment.
What's taskTimeout? You should be deliberate about setting a reasonable timeout for the whole dispatch to the worker. That's different from dispatching tasks internally within the cluster.
89d2cf5 to
79e4e3e
Compare
9ee5998 to
423bb54
Compare
79e4e3e to
7910d4e
Compare
5b2a3b9 to
a35281a
Compare
5b615d9 to
9231795
Compare
a35281a to
3d1b6bb
Compare
9231795 to
bc58c4f
Compare
3d1b6bb to
c868ae3
Compare
bc58c4f to
3eb028c
Compare
c868ae3 to
4d2f4d5
Compare
3eb028c to
92b1a81
Compare
4d2f4d5 to
597d390
Compare
92b1a81 to
b775ca6
Compare
34a1f8a to
d8f33bf
Compare
b775ca6 to
d6af322
Compare
- Add activityCommandTaskDispatcher to dispatch tasks to workers - Use typed Nexus service definition for worker communication - Add proper error handling for DispatchNexusTask response - Release workflow lock before making RPC call - Add functional test for dispatch flow
d8f33bf to
3327132
Compare
What changed?
Sends a Nexus task to the control queue on which the worker is listening for cancellation.
Main changes:
Called by transfer_queue_active_task_executor.go.
Controlled by dynamic config: EnableActivityCancellationNexusTask
Why?
To support activity cancellation without activity heartbeat.
Overall flow:
How did you test it?
Potential risks
No. The task dispatch is disabled by default.