Skip to content

[Bug] Workflow lifecycle logging is misleading #1537

Open
@mjameswh

Description

@mjameswh

Describe what you are trying to do

(From a user) We are trying to get good logs to do alerts on following usecases:

  • When a workflow task fails for more than 3 times (possibly becasue of implementation issue)
  • Workflow fails (because of ApplicationFailure or ActivityFailure etc)

Describe the bug

  • On Workflow Task failure, the lifecycle logger prints out a message indicating that Workflow failed; that's exactly the same error message as on actual Workflow Failure, making it impossible to differentiate these cases.

  • Similarly, one may see Workflow started printed multiple time for a same Workflow Execution, i.e. every single time that the a Worker needs to rebuild (aka “replay”) the runtime state of that Workflow Execution from the very beginning.

Additional context

  • The thing is that this lifecycle handler is logging things from the perspective of "the Cached Workflow Instance” (i.e. the specific instance of that workflow execution in the cache of that specific Workflow Worker), rather than from the perspective of the actual Workflow Execution’s lifecycle.

  • We need to think of a more precise way of formulating those messages. For various reasons, no mention of “Workflow” or “Workflow Task” (starting, failing, completing…) would be 100% reliable at that precise place. For example, Workflow code may attempt to “Complete Workflow”, but the completion command times out or get rejected by the server because of new incoming events, and so what appears to be “Workflow completed” actually ends up being a Workflow Task Failure or Timeout.

  • Community Slack conversation: https://temporalio.slack.com/archives/C01DKSMU94L/p1727436127246899

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions