feat: [SVLS-9168] add aws.durable.operation_attempt tag to durable operations span by lym953 · Pull Request #18191 · DataDog/dd-trace-py

lym953 · 2026-05-20T17:19:13Z

Description

Add aws.durable.operation_attempt to aws.durable.step and aws.durable.wait_for_condition spans. The value is the 1-indexed attempt number, matching the AWS UI's own attempt-count convention:

1 — original attempt
2 — first retry (second attempt)
3 — second retry (third attempt)
etc.

Sourced from Durable Execution SDK's StepDetails.attempt field in the operation checkpoint (which the AWS Lambda Durable service reports 1-indexed). When no checkpoint exists yet (the very first execution before the START checkpoint), the tag defaults to 1.

This tag will be used by UI to display attempt count and group attempts for the same operation.

Testing

Installed the tracer on a durable function and invoked it. The tag shows up for aws.durable.step span. (link)

Why only `step` and `wait_for_condition` spans?

The SDK has six OperationExecutor subclasses; only two use the StepDetails.attempt retry mechanism:

Executor	Detail field on checkpoint	Has retry?
`StepOperationExecutor`	`step_details`	yes
`WaitForConditionOperationExecutor`	`step_details` (polling iterations)	yes
`CallbackOperationExecutor`	`callback_details`	no
`InvokeOperationExecutor`	`chained_invoke_details`	no
`ChildOperationExecutor`	—	no
`WaitOperationExecutor`	`wait_details`	no

wait_for_callback isn't its own executor — it's a helper that internally calls create_callback + step, so any retries appear on the inner aws.durable.step child span (already covered). map and parallel also have no executor; their work is decomposed into MAP_ITERATION / PARALLEL_BRANCH child operations, which _is_top_level_for_span filters out anyway.

Note on test snapshots vs. production

The SDK's aws_durable_execution_sdk_python_testing framework reports step_details.attempt 0-indexed (matching the SDK's documented semantic for "completed prior attempts"), while the production AWS Lambda Durable service reports it 1-indexed (matching the AWS UI). The code passes the value through directly, with max(1, …) guarding the case where the test framework yields 0. As a result, test snapshots show 1 for both the original attempt and the first retry — they verify tag presence, not the distinction between attempt numbers. See the AIDEV-NOTE in patch.py. Production traces correctly distinguish (verified end-to-end on a deployed Lambda).

Add aws.durable.operation_retry_attempt to aws.durable.step and aws.durable.wait_for_condition spans, sourced from the SDK's StepDetails.attempt checkpoint field. 0 = original attempt, N = Nth retry. Set as a numeric metric so it supports range queries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cit-pr-commenter-54b7da · 2026-05-20T17:19:58Z

Codeowners resolved as

ddtrace/contrib/internal/aws_durable_execution_sdk_python/patch.py      @DataDog/apm-core-python @DataDog/apm-idm-python
releasenotes/notes/aws-durable-operation-retry-attempt-tag-553311e75641ec9d.yaml  @DataDog/apm-python
tests/snapshots/tests.contrib.aws_durable_execution_sdk_python.test_aws_durable_execution_sdk_python.test_parallel_propagates_trace_context.json  @DataDog/apm-python
tests/snapshots/tests.contrib.aws_durable_execution_sdk_python.test_aws_durable_execution_sdk_python.test_step_with_retry.json  @DataDog/apm-python
tests/snapshots/tests.contrib.aws_durable_execution_sdk_python.test_aws_durable_execution_sdk_python.test_workflow_failed_status.json  @DataDog/apm-python

datadog-datadog-prod-us1 · 2026-05-20T17:20:41Z

Tests

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

pr-name | pr_name_lint

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration.
PR title linting failed. Errors: subject may not be empty, type may not be empty.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: c363d36 | Docs | Datadog PR Page | Give us feedback!}

pr-commenter · 2026-05-20T17:48:12Z

Benchmarks

Benchmark execution time: 2026-05-20 21:44:25

Comparing candidate commit c363d36 in PR branch yiming.luo/durable-retry-attempt with baseline commit 0f6f41d in branch main.

Found 0 performance improvements and 4 performance regressions! Performance is the same for 372 metrics, 9 unstable metrics.

scenario:httppropagationinject-ids_only

🟥 execution_time [+1.564µs; +1.731µs] or [+7.714%; +8.535%]

scenario:span-start

🟥 execution_time [+1.467ms; +1.587ms] or [+9.456%; +10.229%]

scenario:telemetryaddmetric-1-count-metric-1-times

🟥 execution_time [+254.276ns; +287.182ns] or [+12.270%; +13.858%]

scenario:tracer-small

🟥 execution_time [+26.199µs; +27.811µs] or [+7.501%; +7.962%]

…try_attempt Match the style of the other span-attribute setters in the same subscriber. The wire result is identical — _set_attribute dispatches by value type, so an int still lands in the metrics dict as a float. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-indexed in prod The AWS Lambda Durable service reports `step_details.attempt` 1-indexed (1 for the first attempt, 2 after the first retry), not 0-indexed like the SDK's own documented semantic. Subtract 1 (clamped to 0) so the tag emits the user-facing retry count: 0 for the original attempt, 1 for the first retry, etc. Updates the test_step_with_retry snapshot: the retry-success span's value drops from 1 to 0 because the SDK testing framework already returns step_details.attempt as the retry count directly. The test framework and prod disagree on the semantics; the AIDEV-NOTE in patch.py captures the discrepancy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ttempt Rename `aws.durable.operation_retry_attempt` to `aws.durable.operation_attempt`. Semantics remain 0-indexed: 0 = original attempt, 1 = first retry, etc. - TAG_OPERATION_RETRY_ATTEMPT renamed to TAG_OPERATION_ATTEMPT - AwsDurableOperationEvent field renamed accordingly - Snapshot keys updated; values unchanged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…indexed Match the AWS UI's attempt-count convention: 1 = original attempt, 2 = first retry, etc. Pass step_details.attempt through directly (it's already 1-indexed in production); default to 1 when no checkpoint exists yet, and clamp with max(1, …) to handle the SDK testing framework's 0-indexed values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c363d36cfd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T21:47:02Z

+            if isinstance(event, AwsDurableOperationEvent) and event.operation in _RETRYABLE_OPERATIONS:
+                operation = checkpoint.operation
+                if operation is not None and operation.step_details is not None:
+                    event.operation_attempt = max(1, operation.step_details.attempt)


Convert retry attempt to 1-based before tagging

For retrying step/wait_for_condition operations, using max(1, operation.step_details.attempt) collapses attempts 0 and 1 to the same value (1) when the SDK reports attempts as 0-based, so the first retry is indistinguishable from the initial attempt. This directly undermines the new tag’s stated purpose of distinguishing retries (1, 2, 3, ...); in practice you can already see both attempts tagged 1 in the updated test_step_with_retry snapshot. Consider converting with attempt + 1 (or otherwise normalizing by source) instead of clamping with max.

Useful? React with 👍 / 👎.

lym953 and others added 4 commits May 20, 2026 14:36

lym953 requested a review from pablomartinezbernardo May 20, 2026 21:40

lym953 changed the title ~~feat(aws_durable_execution_sdk_python): add operation_retry_attempt tag~~ [SVLS-9168] feat: add operation_attempt tag to durable operations span May 20, 2026

lym953 changed the title ~~[SVLS-9168] feat: add operation_attempt tag to durable operations span~~ [SVLS-9168] feat: add aws.durable.operation_attempt tag to durable operations span May 20, 2026

lym953 marked this pull request as ready for review May 20, 2026 21:42

lym953 requested review from a team as code owners May 20, 2026 21:42

lym953 requested review from brettlangdon, mabdinur and wantsui May 20, 2026 21:42

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

lym953 mentioned this pull request May 21, 2026

feat(aws-durable-execution-sdk-js): add aws.durable.operation_attempt tag DataDog/dd-trace-js#8595

Draft

lym953 changed the title ~~[SVLS-9168] feat: add aws.durable.operation_attempt tag to durable operations span~~ feat: [SVLS-9168] add aws.durable.operation_attempt tag to durable operations span May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: [SVLS-9168] add aws.durable.operation_attempt tag to durable operations span#18191

feat: [SVLS-9168] add aws.durable.operation_attempt tag to durable operations span#18191
lym953 wants to merge 5 commits into
mainfrom
yiming.luo/durable-retry-attempt

lym953 commented May 20, 2026 •

edited

Loading

Uh oh!

cit-pr-commenter-54b7da Bot commented May 20, 2026 •

edited

Loading

Uh oh!

datadog-datadog-prod-us1 Bot commented May 20, 2026 •

edited by datadog-prod-us1-3 Bot

Loading

Uh oh!

pr-commenter Bot commented May 20, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lym953 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Why only step and wait_for_condition spans?

Note on test snapshots vs. production

Uh oh!

cit-pr-commenter-54b7da Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codeowners resolved as

Uh oh!

datadog-datadog-prod-us1 Bot commented May 20, 2026 • edited by datadog-prod-us1-3 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

ℹ️ Info

Uh oh!

pr-commenter Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

scenario:httppropagationinject-ids_only

scenario:span-start

scenario:telemetryaddmetric-1-count-metric-1-times

scenario:tracer-small

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lym953 commented May 20, 2026 •

edited

Loading

Why only `step` and `wait_for_condition` spans?

cit-pr-commenter-54b7da Bot commented May 20, 2026 •

edited

Loading

datadog-datadog-prod-us1 Bot commented May 20, 2026 •

edited by datadog-prod-us1-3 Bot

Loading

pr-commenter Bot commented May 20, 2026 •

edited

Loading