-
Notifications
You must be signed in to change notification settings - Fork 218
feat: classify run failure error codes and improve error logging #1340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| --- | ||
| "@workflow/errors": patch | ||
| "@workflow/core": patch | ||
| "@workflow/web": patch | ||
| "@workflow/world-local": patch | ||
| "@workflow/world-vercel": patch | ||
| --- | ||
|
|
||
| Add error code classification (`USER_ERROR`, `RUNTIME_ERROR`) to `run_failed` events, improve queue and schema validation error logging |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| import { | ||
| RUN_ERROR_CODES, | ||
| WorkflowAPIError, | ||
| WorkflowRuntimeError, | ||
| } from '@workflow/errors'; | ||
| import { describe, expect, it } from 'vitest'; | ||
| import { classifyRunError } from './classify-error.js'; | ||
|
|
||
| describe('classifyRunError', () => { | ||
| it('classifies WorkflowRuntimeError as RUNTIME_ERROR', () => { | ||
| expect( | ||
| classifyRunError(new WorkflowRuntimeError('corrupted event log')) | ||
| ).toBe(RUN_ERROR_CODES.RUNTIME_ERROR); | ||
| }); | ||
|
|
||
| it('classifies plain Error as USER_ERROR', () => { | ||
| expect(classifyRunError(new Error('user code broke'))).toBe( | ||
| RUN_ERROR_CODES.USER_ERROR | ||
| ); | ||
| }); | ||
|
|
||
| it('classifies TypeError as USER_ERROR', () => { | ||
| expect(classifyRunError(new TypeError('cannot read property'))).toBe( | ||
| RUN_ERROR_CODES.USER_ERROR | ||
| ); | ||
| }); | ||
|
|
||
| it('classifies WorkflowAPIError as USER_ERROR (from user code fetch)', () => { | ||
| expect( | ||
| classifyRunError( | ||
| new WorkflowAPIError('Internal Server Error', { status: 500 }) | ||
| ) | ||
| ).toBe(RUN_ERROR_CODES.USER_ERROR); | ||
| }); | ||
|
|
||
| it('classifies string throw as USER_ERROR', () => { | ||
| expect(classifyRunError('string error')).toBe(RUN_ERROR_CODES.USER_ERROR); | ||
| }); | ||
|
|
||
| it('classifies null throw as USER_ERROR', () => { | ||
| expect(classifyRunError(null)).toBe(RUN_ERROR_CODES.USER_ERROR); | ||
| }); | ||
|
|
||
| it('classifies undefined throw as USER_ERROR', () => { | ||
| expect(classifyRunError(undefined)).toBe(RUN_ERROR_CODES.USER_ERROR); | ||
| }); | ||
| }); |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| import { | ||
| type RunErrorCode, | ||
| RUN_ERROR_CODES, | ||
| WorkflowRuntimeError, | ||
| } from '@workflow/errors'; | ||
|
|
||
| /** | ||
| * Classify an error that caused a workflow run to fail. | ||
| * | ||
| * After the structural separation of infrastructure vs user code error | ||
| * handling, the only errors that reach the `run_failed` try/catch are: | ||
| * - User code errors (throws from workflow functions, propagated step failures) | ||
| * - WorkflowRuntimeError (corrupted event log, missing timestamps, etc.) | ||
| */ | ||
| export function classifyRunError(err: unknown): RunErrorCode { | ||
| if (WorkflowRuntimeError.is(err)) { | ||
| return RUN_ERROR_CODES.RUNTIME_ERROR; | ||
| } | ||
| return RUN_ERROR_CODES.USER_ERROR; | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,9 @@ | ||
| import { WorkflowAPIError, WorkflowRuntimeError } from '@workflow/errors'; | ||
| import { | ||
| RUN_ERROR_CODES, | ||
| WorkflowAPIError, | ||
| WorkflowRuntimeError, | ||
| } from '@workflow/errors'; | ||
| import { classifyRunError } from './classify-error.js'; | ||
| import { parseWorkflowName } from '@workflow/utils/parse-name'; | ||
| import { | ||
| type Event, | ||
|
|
@@ -195,6 +200,7 @@ export function workflowEntrypoint( | |
| message: err.message, | ||
| stack: err.stack, | ||
| }, | ||
| errorCode: RUN_ERROR_CODES.RUNTIME_ERROR, | ||
| }, | ||
| }, | ||
| { requestId } | ||
|
|
@@ -368,8 +374,14 @@ export function workflowEntrypoint( | |
| ); | ||
| } | ||
|
|
||
| // Classify the error: WorkflowRuntimeError indicates an | ||
| // internal issue (corrupted event log, missing data); | ||
| // everything else is a user code error. | ||
| const errorCode = classifyRunError(err); | ||
|
|
||
| runtimeLogger.error('Error while running workflow', { | ||
| workflowRunId: runId, | ||
| errorCode, | ||
| errorName, | ||
| errorStack, | ||
| }); | ||
|
|
@@ -386,7 +398,7 @@ export function workflowEntrypoint( | |
| message: errorMessage, | ||
| stack: errorStack, | ||
| }, | ||
| // TODO: include error codes when we define them | ||
| errorCode, | ||
| }, | ||
| }, | ||
|
Comment on lines
377
to
403
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed — the setup failure path now includes |
||
| { requestId } | ||
|
|
@@ -404,6 +416,7 @@ export function workflowEntrypoint( | |
| } | ||
| ); | ||
| span?.setAttributes({ | ||
| ...Attribute.WorkflowErrorCode(errorCode), | ||
| ...Attribute.WorkflowErrorName(errorName), | ||
| ...Attribute.WorkflowErrorMessage(errorMessage), | ||
| ...Attribute.ErrorType(errorName), | ||
|
|
@@ -416,6 +429,7 @@ export function workflowEntrypoint( | |
|
|
||
| span?.setAttributes({ | ||
| ...Attribute.WorkflowRunStatus('failed'), | ||
| ...Attribute.WorkflowErrorCode(errorCode), | ||
| ...Attribute.WorkflowErrorName(errorName), | ||
| ...Attribute.WorkflowErrorMessage(errorMessage), | ||
| ...Attribute.ErrorType(errorName), | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| /** | ||
| * Error codes for classifying run failures. | ||
| * These are populated in the `errorCode` field of `run_failed` events | ||
| * and flow through to `StructuredError.code` on the run entity. | ||
| */ | ||
| export const RUN_ERROR_CODES = { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If these are also user-facing, should add to docs
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call — will add docs in a follow-up. These are user-facing via
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added docs in #1445 — documents error codes in the errors & retries guide. |
||
| /** Error thrown in user workflow or step code */ | ||
| USER_ERROR: 'USER_ERROR', | ||
| /** Internal runtime error (corrupted event log, missing timestamps) */ | ||
| RUNTIME_ERROR: 'RUNTIME_ERROR', | ||
| } as const; | ||
|
|
||
| export type RunErrorCode = | ||
| (typeof RUN_ERROR_CODES)[keyof typeof RUN_ERROR_CODES]; | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add one e2e test that actually causes + asserts a
RUNTIME_ERRORThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triggering
RUNTIME_ERRORin e2e requires corrupting the event log (e.g., injecting an unexpected event type for a step). This is fragile and environment-dependent. We have unit test coverage forWorkflowRuntimeErrorclassification inclassify-error.test.tsand for the error itself instep.test.ts. Could add an e2e in a follow-up with a dedicated fault injection mechanism.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
human: I tried this but not possible without chaos testing or a bigger change out of scope of this PR. tl;dr we can't easily inject failures into runs (we do on steps by injecting 500s into the world) but for workflow/run code - we would either need to expose things into the VM to allow it to inject that (i.e. changing runtime code just for a fault injection test) - or another ideas is we need to use a proxy in front of workflow-server and queue to inject these failures
at that point I'm thinking we just do this in the chaos testing @TooTallNate is setting up and we can have validation that it's working once that's up