Skip to content

feat: [SVLS-9168] add aws.durable.operation_attempt tag to durable operations span#18191

Open
lym953 wants to merge 5 commits into
mainfrom
yiming.luo/durable-retry-attempt
Open

feat: [SVLS-9168] add aws.durable.operation_attempt tag to durable operations span#18191
lym953 wants to merge 5 commits into
mainfrom
yiming.luo/durable-retry-attempt

Conversation

@lym953
Copy link
Copy Markdown

@lym953 lym953 commented May 20, 2026

Description

Add aws.durable.operation_attempt to aws.durable.step and aws.durable.wait_for_condition spans. The value is the 1-indexed attempt number, matching the AWS UI's own attempt-count convention:

  • 1 — original attempt
  • 2 — first retry (second attempt)
  • 3 — second retry (third attempt)
  • etc.

Sourced from Durable Execution SDK's StepDetails.attempt field in the operation checkpoint (which the AWS Lambda Durable service reports 1-indexed). When no checkpoint exists yet (the very first execution before the START checkpoint), the tag defaults to 1.

This tag will be used by UI to display attempt count and group attempts for the same operation.

Testing

Installed the tracer on a durable function and invoked it. The tag shows up for aws.durable.step span. (link)
image

Why only step and wait_for_condition spans?

The SDK has six OperationExecutor subclasses; only two use the StepDetails.attempt retry mechanism:

Executor Detail field on checkpoint Has retry?
StepOperationExecutor step_details yes
WaitForConditionOperationExecutor step_details (polling iterations) yes
CallbackOperationExecutor callback_details no
InvokeOperationExecutor chained_invoke_details no
ChildOperationExecutor no
WaitOperationExecutor wait_details no

wait_for_callback isn't its own executor — it's a helper that internally calls create_callback + step, so any retries appear on the inner aws.durable.step child span (already covered). map and parallel also have no executor; their work is decomposed into MAP_ITERATION / PARALLEL_BRANCH child operations, which _is_top_level_for_span filters out anyway.

Note on test snapshots vs. production

The SDK's aws_durable_execution_sdk_python_testing framework reports step_details.attempt 0-indexed (matching the SDK's documented semantic for "completed prior attempts"), while the production AWS Lambda Durable service reports it 1-indexed (matching the AWS UI). The code passes the value through directly, with max(1, …) guarding the case where the test framework yields 0. As a result, test snapshots show 1 for both the original attempt and the first retry — they verify tag presence, not the distinction between attempt numbers. See the AIDEV-NOTE in patch.py. Production traces correctly distinguish (verified end-to-end on a deployed Lambda).

Add aws.durable.operation_retry_attempt to aws.durable.step and
aws.durable.wait_for_condition spans, sourced from the SDK's
StepDetails.attempt checkpoint field. 0 = original attempt, N = Nth
retry. Set as a numeric metric so it supports range queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented May 20, 2026

Codeowners resolved as

ddtrace/contrib/internal/aws_durable_execution_sdk_python/patch.py      @DataDog/apm-core-python @DataDog/apm-idm-python
releasenotes/notes/aws-durable-operation-retry-attempt-tag-553311e75641ec9d.yaml  @DataDog/apm-python
tests/snapshots/tests.contrib.aws_durable_execution_sdk_python.test_aws_durable_execution_sdk_python.test_parallel_propagates_trace_context.json  @DataDog/apm-python
tests/snapshots/tests.contrib.aws_durable_execution_sdk_python.test_aws_durable_execution_sdk_python.test_step_with_retry.json  @DataDog/apm-python
tests/snapshots/tests.contrib.aws_durable_execution_sdk_python.test_aws_durable_execution_sdk_python.test_workflow_failed_status.json  @DataDog/apm-python

@datadog-datadog-prod-us1
Copy link
Copy Markdown
Contributor

datadog-datadog-prod-us1 Bot commented May 20, 2026

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

pr-name | pr_name_lint   View in Datadog   GitHub Actions

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. PR title linting failed. Errors: subject may not be empty, type may not be empty.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: c363d36 | Docs | Datadog PR Page | Give us feedback!

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 20, 2026

Benchmarks

Benchmark execution time: 2026-05-20 21:44:25

Comparing candidate commit c363d36 in PR branch yiming.luo/durable-retry-attempt with baseline commit 0f6f41d in branch main.

Found 0 performance improvements and 4 performance regressions! Performance is the same for 372 metrics, 9 unstable metrics.

scenario:httppropagationinject-ids_only

  • 🟥 execution_time [+1.564µs; +1.731µs] or [+7.714%; +8.535%]

scenario:span-start

  • 🟥 execution_time [+1.467ms; +1.587ms] or [+9.456%; +10.229%]

scenario:telemetryaddmetric-1-count-metric-1-times

  • 🟥 execution_time [+254.276ns; +287.182ns] or [+12.270%; +13.858%]

scenario:tracer-small

  • 🟥 execution_time [+26.199µs; +27.811µs] or [+7.501%; +7.962%]

lym953 and others added 4 commits May 20, 2026 14:36
…try_attempt

Match the style of the other span-attribute setters in the same
subscriber. The wire result is identical — _set_attribute dispatches by
value type, so an int still lands in the metrics dict as a float.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-indexed in prod

The AWS Lambda Durable service reports `step_details.attempt` 1-indexed
(1 for the first attempt, 2 after the first retry), not 0-indexed like
the SDK's own documented semantic. Subtract 1 (clamped to 0) so the tag
emits the user-facing retry count: 0 for the original attempt, 1 for the
first retry, etc.

Updates the test_step_with_retry snapshot: the retry-success span's
value drops from 1 to 0 because the SDK testing framework already
returns step_details.attempt as the retry count directly. The test
framework and prod disagree on the semantics; the AIDEV-NOTE in
patch.py captures the discrepancy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ttempt

Rename `aws.durable.operation_retry_attempt` to `aws.durable.operation_attempt`.
Semantics remain 0-indexed: 0 = original attempt, 1 = first retry, etc.

- TAG_OPERATION_RETRY_ATTEMPT renamed to TAG_OPERATION_ATTEMPT
- AwsDurableOperationEvent field renamed accordingly
- Snapshot keys updated; values unchanged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…indexed

Match the AWS UI's attempt-count convention: 1 = original attempt,
2 = first retry, etc. Pass step_details.attempt through directly (it's
already 1-indexed in production); default to 1 when no checkpoint
exists yet, and clamp with max(1, …) to handle the SDK testing
framework's 0-indexed values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lym953 lym953 changed the title feat(aws_durable_execution_sdk_python): add operation_retry_attempt tag [SVLS-9168] feat: add operation_attempt tag to durable operations span May 20, 2026
@lym953 lym953 changed the title [SVLS-9168] feat: add operation_attempt tag to durable operations span [SVLS-9168] feat: add aws.durable.operation_attempt tag to durable operations span May 20, 2026
@lym953 lym953 marked this pull request as ready for review May 20, 2026 21:42
@lym953 lym953 requested review from a team as code owners May 20, 2026 21:42
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c363d36cfd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if isinstance(event, AwsDurableOperationEvent) and event.operation in _RETRYABLE_OPERATIONS:
operation = checkpoint.operation
if operation is not None and operation.step_details is not None:
event.operation_attempt = max(1, operation.step_details.attempt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Convert retry attempt to 1-based before tagging

For retrying step/wait_for_condition operations, using max(1, operation.step_details.attempt) collapses attempts 0 and 1 to the same value (1) when the SDK reports attempts as 0-based, so the first retry is indistinguishable from the initial attempt. This directly undermines the new tag’s stated purpose of distinguishing retries (1, 2, 3, ...); in practice you can already see both attempts tagged 1 in the updated test_step_with_retry snapshot. Consider converting with attempt + 1 (or otherwise normalizing by source) instead of clamping with max.

Useful? React with 👍 / 👎.

@lym953 lym953 changed the title [SVLS-9168] feat: add aws.durable.operation_attempt tag to durable operations span feat: [SVLS-9168] add aws.durable.operation_attempt tag to durable operations span May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant