Skip to content

Implement retry state machine aligned with Kotlin/Swift#144

Open
MichaelGHSeg wants to merge 3 commits into
mainfrom
retry-state-machine
Open

Implement retry state machine aligned with Kotlin/Swift#144
MichaelGHSeg wants to merge 3 commits into
mainfrom
retry-state-machine

Conversation

@MichaelGHSeg
Copy link
Copy Markdown
Contributor

Summary

  • Ports the RetryStateMachine architecture from analytics-kotlin to align retry behavior across all three SDKs
  • Adds a pure, side-effect-free state machine that the EventPipeline consults before/after each upload — no in-loop retry; failed batches stay on disk and are retried on the next flush cycle
  • Supports CDN-driven httpConfig with rateLimitConfig and backoffConfig sub-configs (presence implies enabled unless explicitly false)
  • Both configs default to enabled: false (legacy mode) until CDN settings arrive

New retry behavior (when enabled):

  • Status code classification: 408/410/429/460 → retry; 501/505 → drop; other 4xx → drop; other 5xx → retry
  • Exponential backoff: base 0.5s, max 300s, 10% jitter, max 100 retries, max 12h total duration
  • Rate limiting (429): Retry-After header clamped to maxRetryInterval (300s default), pipeline-wide pause
  • X-Retry-Count header: sent on retry attempts for server observability
  • State persistence: retry state saved to storage, survives app restarts
  • Dynamic config: CDN httpConfig updates reconfigure the state machine without losing state

Files added:

  • Analytics-CSharp/Segment/Analytics/Retry/ — RetryStateMachine, RetryState, RetryConfig, RetryTypes, TimeProvider, RetryStateStorage, HttpConfigParser
  • Tests/Retry/ — 47 unit tests covering state machine logic, storage round-trips, and config parsing

Files modified:

  • EventPipeline.cs / SyncEventPipeline.cs — upload loop consults state machine
  • HTTPClient.cs — X-Retry-Count header, Retry-After parsing, UploadWithResponse method
  • SegmentDestination.cs — parses httpConfig from CDN settings
  • e2e-cli/ — retry-enabled pipeline provider, flush/poll loop for e2e testing

Test plan

  • 205 unit tests pass (including 47 new retry tests)
  • E2E basic suite: 2/2 passing
  • E2E retry suite: 46/46 passing
  • E2E retry-settings suite: 20/20 passing
  • Verify backward compatibility: legacy mode (both configs disabled) preserves existing keep-on-429/5xx, drop-on-4xx behavior
  • Manual test with real Segment CDN returning httpConfig

Port the RetryStateMachine architecture from analytics-kotlin to align
retry behavior across all three SDKs. The state machine classifies HTTP
status codes, tracks per-batch failure counts with exponential backoff,
handles 429 rate limiting with Retry-After support, and accepts dynamic
configuration from CDN settings (httpConfig).

Key changes:
- Add Retry/ module: RetryStateMachine, RetryState, RetryConfig,
  RetryTypes, TimeProvider, RetryStateStorage, HttpConfigParser
- Update EventPipeline and SyncEventPipeline upload loops to consult
  the state machine before/after each upload
- Add X-Retry-Count header and Retry-After parsing to HTTPClient
- Wire SegmentDestination to parse httpConfig from CDN settings
- Update e2e-cli with retry-enabled pipeline provider and poll loop
- Add 47 unit tests and enable retry + retry-settings e2e suites
  (68 e2e tests passing)
Copy link
Copy Markdown
Contributor Author

@MichaelGHSeg MichaelGHSeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

Self-Review
Posted by PR author (MichaelGHSeg) — self-reviews cannot apply the APPROVED label. Intended verdict: Manual Review Needed ⚠️. Ask a teammate to run plugin-review for the APPROVED label.

Code Review Personae Verdict: Manual Review Needed ⚠️

Strong port of the Kotlin RetryStateMachine — the state-machine logic (HandleResponse, ShouldUploadBatch, ShouldDeleteBatch, backoff math, 429/Retry-After handling, maxRetryCount semantics) matches the Kotlin reference and is correct. Two issues warrant changes before merge: (1) the hand-rolled JSON state serializer escapes string keys on write but never unescapes on read, so on Windows (where batch keys are absolute file paths with backslashes) persisted retry metadata never round-trips — orphaning backoff timers and the max-retry drop counter after restart; (2) the e2e harness can mask a genuinely dropped batch as a passing test.

Several medium IO/threading items are worth addressing for alignment with Kotlin.

Warning

Required Actions

  1. Unescape JSON string keys/values on read in RetryStateStorage (or use a real JSON codec) — Windows path keys currently fail to round-trip (Analytics-CSharp/Segment/Analytics/Retry/RetryStateStorage.cs:191)
  2. Stop clearing deliveryErrors mid-poll in e2e-cli — a dropped batch can be masked as success when other batches are still pending (e2e-cli/Program.cs:261)

📝 Deep Code Review (3 Passes) — 2 High, 5 Medium, 3 Low ❌

Three independent bug-finder passes plus a manual verification of the top claims against the repo and the Kotlin reference.

Consensus (found by 2+ passes):

  • RetryStateStorage escape/unescape round-trip — flagged High by passes 1 & 2, and verified: FileEventStream.Read() returns FileInfo.FullName (absolute paths), so on Windows batch keys contain backslashes. EscapeJsonString doubles them on write but ParseBatchMetadata/ParseJsonFields extract keys with raw Substring and never unescape — the loaded key no longer matches the live URL.
  • Per-batch SaveRetryState on the network dispatcher — flagged by passes 2 & 3. Kotlin wraps saveRetryState in withContext(fileIODispatcher) (EventPipeline.kt:273); the C# port writes the full prefs blob inline on NetworkIODispatcher, once per batch per flush.
  • Unsynchronized _retryStateMachine swap — flagged by passes 1 & 2. UpdateHttpConfig (AnalyticsDispatcher) reassigns the field while Upload() (NetworkIODispatcher) reads it; no volatile/lock.

Verified correct (no action): backoff overflow is capped (Math.Min vs Infinity → maxMs, no NaN), MaxTotalBackoffDuration*1000 does not overflow long, _retryState is correctly preserved across UpdateHttpConfig (matches Kotlin), Retry-After=0 yields immediate retry, exception-as-500 handling matches Kotlin.

Findings

  • 🔴 HIGH: [Found by 2/3 passes] State keys escaped on write but never unescaped on read — Windows path keys break round-trip (Analytics-CSharp/Segment/Analytics/Retry/RetryStateStorage.cs:191)
    Batch keys are full file paths: FileEventStream.Read() returns _directory.GetFiles().Select(f => f.FullName) (EventOutputStream.cs:222). On Windows these contain backslashes (e.g. C:\Users\x\...\segment.events.123.tmp). SerializeState correctly calls EscapeJsonString(kvp.Key) (line 61/76), turning \ into \\. But DeserializeState never reverses it: ParseBatchMetadata (line 191) and ParseJsonFields (line ~147) extract keys/values with raw json.Substring(...) and IndexOf('"'). So a key written from C:\Users\... is read back with doubled backslashes and no longer equals the live url passed to ShouldUploadBatch/GetRetryCount. After any save→load round-trip (process restart; the in-loop SaveRetryState then a fresh pipeline's LoadRetryState), per-batch FailureCount/NextRetryTime/FirstFailureTime are orphaned: backoff timers reset and the per-batch max-retry drop limit never trips, so a permanently-failing batch retries forever. Additionally, IndexOf('"') is not escape-aware, so a key containing an escaped quote terminates early and corrupts the rest of the parse. Kotlin avoids this by using kotlinx.serialization (a real JSON codec).
    Suggestion: Add an UnescapeJsonString helper (reverse of EscapeJsonString: \\\, \"") and apply it to every parsed key and string value; make the closing-quote scan skip the char after a backslash. Add a round-trip unit test with a Windows-style path key like C:\\a\\b.tmp.
  • 🔴 HIGH: deliveryErrors.Clear() in poll loop can mask a genuine batch drop (e2e-cli/Program.cs:261)
    When a batch is dropped (DropBatchDecision, or a 4xx/non-2xx with ShouldDeleteBatch==true), the pipeline calls ReportInternalErrorCapturingErrorHandler appends to deliveryErrors, and removes the file in the SAME cycle. The poll loop calls deliveryErrors.Clear() on every cycle where pending.Count > 0 (line 261). So if a batch is genuinely dropped while other batch files are still pending, the next poll sees pending>0, wipes the drop error, and once the survivors upload the harness reports success=true — masking a real dropped/rejected batch as a passing test. The single-batch case works only because pending hits 0 in the same cycle the error fires, skipping the clear.
    Suggestion: Don't clear deliveryErrors during the poll loop. Track a separate never-cleared dropped-batch count (or accumulate+dedupe errors) and treat any drop as failure regardless of whether other batches later succeed.
  • 🟡 MEDIUM: [Found by 2/3 passes] SaveRetryState rewrites the full prefs file per batch on the network thread (diverges from Kotlin) (Analytics-CSharp/Segment/Analytics/Utilities/EventPipeline.cs:242)
    RetryStateStorage.SaveRetryStatestorage.WritePrefsUserPrefs.Put serializes the entire prefs cache and rewrites the file (with a .bak copy). The Upload loop calls it once per batch — both the drop path (line 186) and the post-response path (line 242) — even on a plain 2xx success where nothing changed. For N pending batches that is N full-file writes per flush, and the e2e poll loop re-flushes every interval. Kotlin moves this off the network thread via withContext(fileIODispatcher) { storage.saveRetryState(...) } (EventPipeline.kt:273-274). Correctness is fine; this is an IO/throughput regression and a dispatcher divergence. Mirror exists in SyncEventPipeline.cs:267.
    Suggestion: Marshal SaveRetryState onto FileIODispatcher (Scope.WithContext) as Kotlin does, and/or save once at the end of the loop or only when _retryState actually changed (skip the legacy-mode success path).
  • 🟡 MEDIUM: [Found by 2/3 passes] Unsynchronized _retryStateMachine swap across dispatchers (Analytics-CSharp/Segment/Analytics/Utilities/EventPipeline.cs:81)
    _retryStateMachine and _retryState are plain mutable fields (no lock/volatile). UpdateHttpConfig (line 76-82) runs on AnalyticsDispatcher (SegmentDestination.Update is launched there) and swaps _retryStateMachine, while Upload() reads it repeatedly per batch on NetworkIODispatcher. Reference assignment is atomic on the CLR (no torn pointer/crash), but there is no happens-before barrier: the upload thread can keep using the stale (legacy) machine for an unbounded time, or switch machines mid-flush between batches, yielding inconsistent decisions in one cycle. Kotlin has the same var pattern, but C#'s distinct dispatchers make the race observable. Note: _retryState itself IS correctly preserved across UpdateHttpConfig (matches Kotlin).
    Suggestion: Mark _retryStateMachine volatile, or read it once into a local at the top of each upload iteration and use that local for all calls in the batch.
  • 🟡 MEDIUM: PipelineState persisted as raw ordinal int; fragile and silently degrades to Ready (Analytics-CSharp/Segment/Analytics/Retry/RetryStateStorage.cs:48)
    SerializeState writes (int)state.PipelineState (line 48) and DeserializeState only sets RateLimited when the parsed int == 1 (line 90). This couples the on-disk format to the enum declaration order (Ready=0, RateLimited=1). If a value is ever inserted before RateLimited, a previously-persisted RateLimited state deserializes as Ready, so a rate-limited pipeline resumes uploading immediately after restart and ignores the server's Retry-After window. Kotlin serializes the enum by name (order-independent). Low likelihood (enums are append-only by convention) but it controls rate-limit safety.
    Suggestion: Serialize/deserialize PipelineState by name ("Ready"/"RateLimited") rather than ordinal, matching Kotlin.
  • 🟡 MEDIUM: ReportInternalError fires on every routine non-2xx drop, surfacing benign rejections as errors (Analytics-CSharp/Segment/Analytics/Utilities/EventPipeline.cs:222)
    On any non-success response where ShouldDeleteBatch returns true (e.g. a plain 400/413, or a 3xx classified as Drop), the loop calls ReportInternalError(NetworkServerRejected, "HTTP {code}: batch rejected by server") (line 222-223). A batch the server legitimately rejects (the correct terminal outcome) is reported as an error. Combined with the e2e harness's new deliveryErrors-based failure gate, a final-cycle drop fails the run even though the SDK behaved correctly and no files remain. The old Upload() path also reported on 4xx, so this is not strictly new, but the interaction with the e2e gate is.
    Suggestion: In e2e-cli base success purely on remaining pending files (a dropped batch is a completed delivery decision). Optionally in the pipeline only report unexpected drops, not expected 4xx rejections.
  • 🟡 MEDIUM: Fast 200 success before first poll causes CLI to wait the full timeout (e2e-cli/Program.cs:243)
    everSeenPending only becomes true if PendingUploads() is non-empty during a poll. With the 2s init Thread.Sleep plus a low flushAt, a fast 200 upload can complete before the first poll (300ms in). everSeenPending stays false, the early-break (stableEmptyCount >= 2) is never reached, and the loop runs until the deadline (default 20s). The result is correct (remaining empty, no errors → success) but every fast-success test pays the full timeout, slowing the whole suite.
    Suggestion: Capture the initial pending count right after the explicit Flush() before polling, or allow the stable-empty early-break even when everSeenPending is false once a flush has been issued.
  • 🔵 LOW: CDN httpConfig update is ignored for custom IEventPipeline implementations (Analytics-CSharp/Segment/Analytics/Plugins/SegmentDestination.cs:92)
    Update() casts _pipeline to EventPipeline and SyncEventPipeline (lines 92/95) and calls UpdateHttpConfig on whichever matches. Any custom IEventPipeline supplied via Configuration.EventPipelineProvider silently ignores CDN httpConfig and stays in its constructed mode. The e2e-cli provider returns a concrete EventPipeline so it's unaffected, but third-party pipelines won't receive CDN-driven enable/disable. Correctness gap, not a crash.
    Suggestion: Add UpdateHttpConfig (or an httpConfig setter) to the IEventPipeline interface, or document that custom pipelines must consume httpConfig themselves.
  • 🔵 LOW: Legacy Upload() lost 3xx error reporting (now effectively dead in the pipeline) (Analytics-CSharp/Segment/Analytics/Utilities/HTTPClient.cs:99)
    The rewritten Upload() no longer reports NetworkUnexpectedHttpCode for 3xx (the old switch did); 3xx now falls through to return false (keep for retry). Upload() is public API but the pipeline now exclusively uses UploadWithResponse(), so this only affects external callers of Upload(). Minor behavior change.
    Suggestion: If Upload() must preserve old semantics for external callers, restore the 3xx ReportInternalError branch; otherwise consider marking Upload() obsolete.
  • 🔵 LOW: Programmatically-built RetryConfig path never calls Validated() (aligned with Kotlin) (Analytics-CSharp/Segment/Analytics/Retry/RetryConfig.cs:304)
    The CDN path (HttpConfigParser.Parse) calls Validated(), but the pipeline constructors build RetryConfig directly from HttpConfig.RateLimitConfig/BackoffConfig without re-validating, and e2e-cli constructs configs from raw test JSON (unclamped). A 0 or negative maxRetryCount makes ShouldUploadBatch drop every batch on the first attempt (GlobalRetryCount 0 >= MaxRetryCount <=0). This matches the Kotlin reference exactly (Kotlin also skips validated() in init/updateHttpConfig and builds e2e configs raw), so it is aligned behavior — noting only as a defensive-clamp opportunity at the input boundary.
    Suggestion: Optionally clamp maxRetries in e2e-cli (Math.Max(1, ...)) to avoid confusing test results; leaving the library aligned with Kotlin is fine.

📝 Policy Compliance Review — No violations (3 policies evaluated)

Evaluated 3 policies (POL-001-port-forward-to-prod-db, POL-002-prod-db-write-from-laptop, POL-003-credentials-in-pr) across all review surfaces (24 surfaces: PR body, commit messages, one code surface per changed file). No violations detected. No loader warnings.

Review Stages Executed

  • ✅ Deep Code Review (3 Passes)
  • ✅ Policy Compliance Review
    Generated by code-review-personae v0.8.1 | Deep Code Review (3 Passes), Policy Compliance Review

int keyEnd = json.IndexOf('"', keyStart + 1);
if (keyEnd < 0) break;
string batchFile = json.Substring(keyStart + 1, keyEnd - keyStart - 1);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

🔴 HIGH: [Found by 2/3 passes] State keys escaped on write but never unescaped on read — Windows path keys break round-trip

Batch keys are full file paths: FileEventStream.Read() returns _directory.GetFiles().Select(f => f.FullName) (EventOutputStream.cs:222). On Windows these contain backslashes (e.g. C:\Users\x\...\segment.events.123.tmp). SerializeState correctly calls EscapeJsonString(kvp.Key) (line 61/76), turning \ into \\. But DeserializeState never reverses it: ParseBatchMetadata (line 191) and ParseJsonFields (line ~147) extract keys/values with raw json.Substring(...) and IndexOf('"'). So a key written from C:\Users\... is read back with doubled backslashes and no longer equals the live url passed to ShouldUploadBatch/GetRetryCount. After any save→load round-trip (process restart; the in-loop SaveRetryState then a fresh pipeline's LoadRetryState), per-batch FailureCount/NextRetryTime/FirstFailureTime are orphaned: backoff timers reset and the per-batch max-retry drop limit never trips, so a permanently-failing batch retries forever. Additionally, IndexOf('"') is not escape-aware, so a key containing an escaped quote terminates early and corrupts the rest of the parse. Kotlin avoids this by using kotlinx.serialization (a real JSON codec).

Suggestion: Add an UnescapeJsonString helper (reverse of EscapeJsonString: \\\, \"") and apply it to every parsed key and string value; make the closing-quote scan skip the char after a backslash. Add a round-trip unit test with a Windows-style path key like C:\\a\\b.tmp.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed4ec1c. Added UnescapeJsonString (reverse of EscapeJsonString: \\\, \"") and an escape-aware FindClosingQuote that skips the char after a backslash, applied to all parsed keys and string values in both ParseJsonFields and ParseBatchMetadata. Windows path keys now round-trip exactly. Added round-trip tests RoundTrip_WindowsPathKeys and RoundTrip_KeyWithEscapedQuoteAndTrailingBatch.

Comment thread e2e-cli/Program.cs Outdated
everSeenPending = true;
stableEmptyCount = 0;
deliveryErrors.Clear();
// Trigger a new upload cycle for retry
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

🔴 HIGH: deliveryErrors.Clear() in poll loop can mask a genuine batch drop

When a batch is dropped (DropBatchDecision, or a 4xx/non-2xx with ShouldDeleteBatch==true), the pipeline calls ReportInternalErrorCapturingErrorHandler appends to deliveryErrors, and removes the file in the SAME cycle. The poll loop calls deliveryErrors.Clear() on every cycle where pending.Count > 0 (line 261). So if a batch is genuinely dropped while other batch files are still pending, the next poll sees pending>0, wipes the drop error, and once the survivors upload the harness reports success=true — masking a real dropped/rejected batch as a passing test. The single-batch case works only because pending hits 0 in the same cycle the error fires, skipping the clear.

Suggestion: Don't clear deliveryErrors during the poll loop. Track a separate never-cleared dropped-batch count (or accumulate+dedupe errors) and treat any drop as failure regardless of whether other batches later succeed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ed4ec1c. Removed the deliveryErrors.Clear() from inside the poll loop. The pipeline only reports an error on a permanent drop/rejection (never on transient retries), so any captured error is now treated as a genuine failure and a real dropped batch can no longer be masked by other batches later succeeding.

);
_retryState = _retryStateMachine.HandleResponse(_retryState, responseInfo);
RetryStateStorage.SaveRetryState(_storage, _retryState);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

🟡 MEDIUM: [Found by 2/3 passes] SaveRetryState rewrites the full prefs file per batch on the network thread (diverges from Kotlin)

RetryStateStorage.SaveRetryStatestorage.WritePrefsUserPrefs.Put serializes the entire prefs cache and rewrites the file (with a .bak copy). The Upload loop calls it once per batch — both the drop path (line 186) and the post-response path (line 242) — even on a plain 2xx success where nothing changed. For N pending batches that is N full-file writes per flush, and the e2e poll loop re-flushes every interval. Kotlin moves this off the network thread via withContext(fileIODispatcher) { storage.saveRetryState(...) } (EventPipeline.kt:273-274). Correctness is fine; this is an IO/throughput regression and a dispatcher divergence. Mirror exists in SyncEventPipeline.cs:267.

Suggestion: Marshal SaveRetryState onto FileIODispatcher (Scope.WithContext) as Kotlin does, and/or save once at the end of the loop or only when _retryState actually changed (skip the legacy-mode success path).

? new RetryConfig(config.RateLimitConfig, config.BackoffConfig)
: new RetryConfig();
_retryStateMachine = new RetryStateMachine(retryConfig);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

🟡 MEDIUM: [Found by 2/3 passes] Unsynchronized _retryStateMachine swap across dispatchers

_retryStateMachine and _retryState are plain mutable fields (no lock/volatile). UpdateHttpConfig (line 76-82) runs on AnalyticsDispatcher (SegmentDestination.Update is launched there) and swaps _retryStateMachine, while Upload() reads it repeatedly per batch on NetworkIODispatcher. Reference assignment is atomic on the CLR (no torn pointer/crash), but there is no happens-before barrier: the upload thread can keep using the stale (legacy) machine for an unbounded time, or switch machines mid-flush between batches, yielding inconsistent decisions in one cycle. Kotlin has the same var pattern, but C#'s distinct dispatchers make the race observable. Note: _retryState itself IS correctly preserved across UpdateHttpConfig (matches Kotlin).

Suggestion: Mark _retryStateMachine volatile, or read it once into a local at the top of each upload iteration and use that local for all calls in the batch.

var sb = new StringBuilder();
sb.Append("{");
sb.Append("\"pipelineState\":\"").Append((int)state.PipelineState).Append("\"");
if (state.WaitUntilTime.HasValue)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

🟡 MEDIUM: PipelineState persisted as raw ordinal int; fragile and silently degrades to Ready

SerializeState writes (int)state.PipelineState (line 48) and DeserializeState only sets RateLimited when the parsed int == 1 (line 90). This couples the on-disk format to the enum declaration order (Ready=0, RateLimited=1). If a value is ever inserted before RateLimited, a previously-persisted RateLimited state deserializes as Ready, so a rate-limited pipeline resumes uploading immediately after restart and ignores the server's Retry-After window. Kotlin serializes the enum by name (order-independent). Low likelihood (enums are append-only by convention) but it controls rate-limit safety.

Suggestion: Serialize/deserialize PipelineState by name ("Ready"/"RateLimited") rather than ordinal, matching Kotlin.

if (shouldCleanup)
{
_analytics.ReportInternalError(AnalyticsErrorType.NetworkServerRejected,
message: "HTTP " + statusCode + ": batch rejected by server");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

🟡 MEDIUM: ReportInternalError fires on every routine non-2xx drop, surfacing benign rejections as errors

On any non-success response where ShouldDeleteBatch returns true (e.g. a plain 400/413, or a 3xx classified as Drop), the loop calls ReportInternalError(NetworkServerRejected, "HTTP {code}: batch rejected by server") (line 222-223). A batch the server legitimately rejects (the correct terminal outcome) is reported as an error. Combined with the e2e harness's new deliveryErrors-based failure gate, a final-cycle drop fails the run even though the SDK behaved correctly and no files remain. The old Upload() path also reported on 4xx, so this is not strictly new, but the interaction with the e2e gate is.

Suggestion: In e2e-cli base success purely on remaining pending files (a dropped batch is a completed delivery decision). Optionally in the pipeline only report unexpected drops, not expected 4xx rejections.

Comment thread e2e-cli/Program.cs
// Poll until batch files are processed (uploaded or dropped).
long deadlineMs = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds() + (timeoutSeconds * 1000L);
bool everSeenPending = false;
int pollInterval = 300;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

🟡 MEDIUM: Fast 200 success before first poll causes CLI to wait the full timeout

everSeenPending only becomes true if PendingUploads() is non-empty during a poll. With the 2s init Thread.Sleep plus a low flushAt, a fast 200 upload can complete before the first poll (300ms in). everSeenPending stays false, the early-break (stableEmptyCount >= 2) is never reached, and the loop runs until the deadline (default 20s). The result is correct (remaining empty, no errors → success) but every fast-success test pays the full timeout, slowing the whole suite.

Suggestion: Capture the initial pending count right after the explicit Flush() before polling, or allow the stable-empty early-break even when everSeenPending is false once a flush has been issued.

if (parsedConfig != null)
{
EventPipeline concretePipeline = _pipeline as EventPipeline;
concretePipeline?.UpdateHttpConfig(parsedConfig);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

🔵 LOW: CDN httpConfig update is ignored for custom IEventPipeline implementations

Update() casts _pipeline to EventPipeline and SyncEventPipeline (lines 92/95) and calls UpdateHttpConfig on whichever matches. Any custom IEventPipeline supplied via Configuration.EventPipelineProvider silently ignores CDN httpConfig and stays in its constructed mode. The e2e-cli provider returns a concrete EventPipeline so it's unaffected, but third-party pipelines won't receive CDN-driven enable/disable. Correctness gap, not a crash.

Suggestion: Add UpdateHttpConfig (or an httpConfig setter) to the IEventPipeline interface, or document that custom pipelines must consume httpConfig themselves.

RetryStateStorage: batch keys are absolute file paths and on Windows
contain backslashes. SerializeState escaped them but DeserializeState
never unescaped, so keys failed to round-trip and per-batch retry
metadata was orphaned after a restart. Add UnescapeJsonString and an
escape-aware closing-quote scan; apply to all parsed keys and string
values. Adds round-trip tests for Windows paths and embedded quotes.

e2e-cli: stop clearing deliveryErrors inside the poll loop. The pipeline
only reports an error on a permanent drop (never on transient retries),
so clearing while other batches are pending could mask a genuine drop as
a passing test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant