Skip to content

Conversation

@msrathore-db
Copy link
Contributor

@msrathore-db msrathore-db commented Oct 27, 2025

Summary

Implements straggler download mitigation for CloudFetch to improve query performance by detecting and cancelling abnormally slow parallel downloads. The feature monitors active downloads and automatically retries stragglers when faster slots become available, with an optional fallback to sequential mode after a configurable threshold.

Changes

New Classes:

  • StragglerDownloadDetector (StragglerDetector.cs)

    • Core detection algorithm using median throughput analysis
    • Configurable multiplier (default 1.5x slower than median)
    • Minimum completion quantile (default 60%)
    • Straggler padding grace period (default 5 seconds)
    • Sequential fallback threshold tracking
    • Duplicate detection prevention via tracking dictionary
  • FileDownloadMetrics (FileDownloadMetrics.cs)

    • Tracks per-file download performance
    • Start time, completion time, and throughput calculation
    • Straggler cancellation flag
    • Thread-safe state management
  • CloudFetchStragglerMitigationConfig (CloudFetchStragglerMitigationConfig.cs)

    • Configuration management and validation
    • Connection property parsing
    • Default values management
    • Parameter range validation

CloudFetchDownloader Integration:

  • Background monitoring thread (runs every 5 seconds)
  • Per-file CancellationTokenSource for clean cancellation
  • Automatic retry mechanism for cancelled stragglers
  • Metrics tracking for all active downloads
  • Sequential fallback mode support
  • Thread-safe metrics and cancellation token management

Configuration Parameters:
All parameters use adbc.databricks.cloudfetch. prefix:

  • straggler_mitigation_enabled (default: false) - Feature toggle
  • straggler_multiplier (default: 1.5) - Throughput multiplier for detection
  • straggler_quantile (default: 0.6) - Minimum completion percentage before detection
  • straggler_padding_seconds (default: 5) - Grace period before flagging as straggler
  • max_stragglers_per_query (default: 10) - Threshold to trigger sequential fallback
  • synchronous_fallback_enabled (default: true) - Enable automatic fallback to sequential mode

Benefits

  • Performance: Up to 50% improvement in queries with straggler downloads (based on Java JDBC driver results)
  • Robustness: Handles network variability and slow storage automatically
  • Safety: Opt-in feature with zero overhead when disabled
  • Flexibility: All parameters are tunable for different network conditions
  • Backward Compatible: No changes to existing behavior when disabled

Technical Details

Detection Algorithm:

  1. Wait until 60% of downloads complete (configurable quantile)
  2. Calculate median throughput from completed downloads
  3. Identify downloads running 50% slower than median (configurable multiplier)
  4. Apply 5-second grace period before flagging as straggler
  5. Cancel via per-file CancellationTokenSource
  6. Track cancelled downloads to prevent duplicate detection

Sequential Fallback:

  • Triggered after 10 stragglers detected in current batch (configurable)
  • Applies only to remaining files in current batch
  • Resets for next FetchResults call
  • Prevents thrashing in consistently slow network conditions

Thread Safety:

  • ConcurrentDictionary for metrics and cancellation tokens
  • Atomic operations for counter increments
  • Proper cleanup in finally blocks
  • Linked cancellation tokens for cascading shutdown

Testing

38 tests total, all passing:

Unit Tests (19):

  • FileDownloadMetrics throughput calculation (before/after completion)
  • FileDownloadMetrics straggler flag setting
  • StragglerDownloadDetector parameter validation (multiplier, quantile)
  • Median calculation correctness (odd/even counts)
  • Quantile threshold enforcement
  • Fallback threshold triggering
  • Empty metrics list handling
  • Cancelled downloads filtering
  • Duplicate detection prevention (tracking dictionary)
  • CancellationTokenSource atomic replacement
  • Cleanup behavior under exceptions
  • Shutdown cancellation respect
  • Concurrent CTS cleanup
  • Counter overflow protection (long type)
  • Concurrent modification safety

E2E Tests (19):

  • Slow download identification and cancellation
  • Fast downloads not marked as stragglers
  • Minimum completion quantile requirement
  • Sequential fallback activation after threshold
  • Sequential mode enforcement (one download at a time)
  • No stragglers detected in sequential mode
  • Sequential fallback applies only to current batch
  • Monitoring thread respects cancellation
  • Parallel mode respects max parallel downloads
  • Cancelled straggler retry logic
  • Mixed speed download scenarios
  • Clean shutdown during monitoring
  • Feature disabled by default verification
  • Configuration parameter definitions
  • Configuration parameter naming convention
  • StragglerDownloadDetector creation
  • FileDownloadMetrics creation
  • Counter increments atomically

Documentation

  • straggler-mitigation-design.md - Comprehensive design doc with algorithm details, implementation notes, configuration guide, and usage examples
  • Inline code comments - Detailed documentation in all new classes

🤖 Generated with Claude Code

@msrathore-db msrathore-db changed the title Added design docs for implementation of straggle download mitigation docs(csharp/src/Drivers/Databricks): Added design docs for implementation of straggle download mitigation Oct 27, 2025
end
```

### 3.2 Download with Straggler Detection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure that we handle a corner case, that if all the download tries are just taking long, it will cause this chunk download failreus, maybe we need some protections that.

  1. for the last retry, don't do straggler cancel
  2. or we keep one download already running when we do straggler retries, and which ever success earlier to take result from that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used the first approach. Updated the doc accordingly

msrathore-db and others added 5 commits October 28, 2025 18:30
Add runtime straggler download detection based on median throughput analysis
with automatic cancellation and retry for CloudFetch operations.

Changes:
- Add 6 new configuration parameters in DatabricksParameters.cs
- Implement FileDownloadMetrics class for tracking download timing/throughput
- Implement StragglerDownloadDetector class for median-based detection algorithm
- Integrate straggler handling into CloudFetchDownloader retry loop
- Add background monitoring task for periodic straggler checks
- Add per-file CancellationTokenSource for granular download cancellation
- Implement edge case protection: last retry attempt cannot be cancelled

Key Features:
- Median throughput calculation for outlier resistance
- 60% quantile threshold before detection starts
- Retry integration: straggler cancellation counts as one retry attempt
- OpenTelemetry instrumentation for observability
- Disabled by default for conservative rollout

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add minimal unit tests and E2E integration tests for straggler detection
feature, focusing on mistake-prone areas and configuration validation.

Unit Tests (12 tests):
- FileDownloadMetrics throughput calculation and state management
- StragglerDownloadDetector parameter validation
- Median calculation with odd/even counts
- Edge cases: empty lists, below threshold, cancelled downloads
- Fallback threshold trigger validation

E2E Tests (6 tests):
- Configuration parameter validation
- Default disabled behavior
- Parameter naming conventions
- Basic integration with default configuration
- Atomic counter operations

All 18 tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…ary semaphore

Add simple sequential download fallback that activates when too many stragglers
are detected, using a secondary semaphore approach for clean throttling.

Implementation:
- Add _sequentialSemaphore (1/1 capacity) and _isSequentialMode flag
- Set _isSequentialMode=true when fallback threshold exceeded
- Conditionally acquire sequential semaphore before downloads
- Release in reverse order (sequential then parallel)
- Dispose sequential semaphore in StopAsync

Key advantages:
- Uses semaphore's native throttling behavior
- Can switch back to parallel by flipping flag
- No task chaining complexity or lock contention
- Clean RAII-style acquire/release pattern
- Minimal code changes (~15 lines)

All 18 tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Address 5 critical implementation issues identified in code review:

1. Semaphore leak: Wrap task creation in try/catch to release semaphores
   if exception occurs after acquisition but before task creation

2. Race condition: Add fileData == null check to straggler cancellation
   handler to prevent unnecessary retries when download completed just
   before cancellation

3. URL refresh null handling: Log warning when URL refresh fails instead
   of silently continuing with potentially expired URL

4. Memory leak prevention: Move cleanup to finally block to ensure
   per-file cancellation tokens are always disposed

5. Fire-and-forget exception handling: Wrap cleanup task in try/catch
   to prevent unobserved task exceptions

All 18 straggler mitigation tests pass after fixes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
| Event Name | When Emitted | Key Tags |
|------------|-------------|----------|
| `cloudfetch.straggler_check` | When stragglers identified | `active_downloads`, `completed_downloads`, `stragglers_identified` |
| `cloudfetch.straggler_cancelling` | Before cancelling straggler | `offset` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need the cancelled tag.

}
```

#### StragglerDownloadDetectorTests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems too much details on the testing part, can we show some skeleton code for the stragger monitor? Like how does it monitor all download threads, how does it restart, is it on the same straggle thread? I think the purpose for restarting is we establish a new http connection with the cloud provider, does it guarantee that?

msrathore-db and others added 6 commits October 29, 2025 09:54
Address 8 critical and important implementation issues:

P0 Critical Issues:
1. Sequential semaphore lifecycle: Remove readonly and disposal to support
   restart scenarios without ObjectDisposedException
2. Sequential semaphore TOCTOU race: Capture mode atomically at acquisition
   time to prevent semaphore count drift
3. Try/finally coverage: Move metrics initialization inside try block to
   ensure cleanup always runs and prevent memory leaks

P1 Important Issues:
4. Duplicate straggler detection: Add tracking dictionary to prevent counting
   same file multiple times across retry cycles
5. Counter overflow protection: Change from int to long (max ~9 quintillion)
   to prevent overflow in pathological scenarios

P2 Issues:
6. Cleanup task cancellation: Use cancellationToken in fire-and-forget cleanup
   to respect shutdown and remove immediately if cancelled
7. File size validation: Add constructor validation to reject zero or negative
   file sizes and prevent invalid throughput calculations
8. Stale CTS atomicity: Use AddOrUpdate to atomically replace cancellation
   token source and dispose old one, preventing race conditions

All 18 straggler mitigation tests pass after fixes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add 25 comprehensive end-to-end tests covering all straggler mitigation
functionality with realistic scenarios.

Test Coverage:
**Straggler Detection (5 tests)**
- Fast/slow download detection with proper timing
- Quantile threshold validation
- All-fast downloads (no false positives)
- Already-cancelled exclusion
- Empty/null metrics handling

**Sequential Fallback (2 tests)**
- Fallback triggers when threshold exceeded
- Fallback does not trigger below threshold

**Duplicate Detection Prevention (2 tests)**
- With tracking dict: same file counted only once
- Without tracking dict: counts multiple times (control test)

**FileDownloadMetrics (4 tests)**
- Invalid size validation (zero and negative)
- Throughput calculation accuracy
- Throughput before completion returns null
- Straggler flag functionality

**Counter Overflow Protection (1 test)**
- Verifies long type usage

**Median Calculation (2 tests)**
- Odd count returns middle value
- Even count returns average of middle two

**Edge Cases (4 tests)**
- No completed downloads
- Empty metrics list
- Null metrics
- Very fast download (< 1ms) without division errors

**Concurrency (2 tests)**
- Parallel detection with thread-safe counter
- Parallel detection with tracking prevents duplicates

**Parameter Validation (4 tests)**
- Invalid multiplier
- Invalid quantile (too low/high)
- Negative padding
- Negative max stragglers

Key Testing Approach:
- Uses helper methods to create fast/slow downloads naturally
- Slow downloads created first, then aged via Thread.Sleep
- Fast downloads complete immediately after creation
- No reflection or mocking needed for timing
- All tests are deterministic and repeatable

Test Results:
✅ All 25 comprehensive E2E tests pass
✅ All 43 total straggler tests pass (unit + basic E2E + comprehensive)
✅ Total test time: ~8 seconds

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Added 5 targeted tests validating code review fixes:
- Duplicate detection prevention (issue apache#5)
- Atomic CTS replacement (issue apache#9)
- Cleanup in finally block (issue apache#3)
- Cancellable cleanup tasks (issue apache#7)
- Concurrent CTS cleanup safety

All tests use real objects (ConcurrentDictionary, CancellationTokenSource)
without mocks, following existing CloudFetch test patterns.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…HTTP

Created 15 comprehensive E2E tests following CloudFetchDownloaderTest.cs pattern:

Passing tests (8):
- FastDownloadsNotMarkedAsStraggler
- RequiresMinimumCompletionQuantile
- MonitoringThreadRespectsCancellation
- ParallelModeRespectsMaxParallelDownloads
- SequentialModeEnforcesOneDownloadAtATime
- NoStragglersDetectedInSequentialMode
- CleanShutdownDuringMonitoring
- FeatureDisabledByDefault

Tests validate:
- Monitoring thread lifecycle and cancellation
- Semaphore behavior (parallel and sequential modes)
- Minimum completion quantile requirement
- Feature disabled without configuration
- Clean shutdown during operations

Known issues (4 tests failing):
- Difficulty mocking abstract/internal HiveServer2Connection
- Needs investigation for proper property configuration

Test coverage includes straggler detection, sequential fallback,
semaphore management, retry logic, and complex scenarios.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…ocused tests

Changes:
- Moved tests from E2E to Unit (were unit tests, not E2E)
- Reduced from 30 redundant tests to 10 critical tests
- Removed obvious tests (parameter validation, basic getters/setters)

Unit tests now focus on:
- Duplicate detection prevention across monitoring cycles
- Atomic CTS replacement for retries
- Cleanup execution in finally blocks
- Cleanup cancellation during shutdown
- Concurrent cleanup safety
- Counter overflow protection (long vs int)
- Median calculation correctness (even/odd count)
- Empty metrics null safety
- Concurrent modification thread safety

All 10 tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Adds straggler download mitigation feature to improve CloudFetch performance
by detecting and cancelling abnormally slow parallel downloads.

Implementation:
- New StragglerDownloadDetector class for detecting slow downloads
- New FileDownloadMetrics class for tracking download performance
- New CloudFetchStragglerMitigationConfig for configuration management
- Integration into CloudFetchDownloader with background monitoring thread
- Automatic fallback to sequential downloads after threshold

Configuration Parameters:
- adbc.databricks.cloudfetch.straggler_mitigation_enabled (default: false)
- adbc.databricks.cloudfetch.straggler_multiplier (default: 1.5)
- adbc.databricks.cloudfetch.straggler_quantile (default: 0.6)
- adbc.databricks.cloudfetch.straggler_padding_seconds (default: 5)
- adbc.databricks.cloudfetch.max_stragglers_per_query (default: 10)
- adbc.databricks.cloudfetch.synchronous_fallback_enabled (default: true)

Tests:
- 19 comprehensive unit tests covering basic functionality and advanced scenarios
- 19 E2E tests with mocked HTTP responses validating real-world scenarios
- All tests pass successfully

Documentation:
- straggler-mitigation-design.md: comprehensive design documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@msrathore-db msrathore-db changed the title docs(csharp/src/Drivers/Databricks): Added design docs for implementation of straggle download mitigation feat(csharp/src/Drivers/Databricks): Implement straggler download mitigation for CloudFetch Nov 4, 2025
Replace string.Split(string) with string.Split(string[], StringSplitOptions)
as the single-string overload is not available in .NET Framework 4.7.2.

Fixes compilation errors in CloudFetchStragglerDownloaderE2ETests.cs.
@msrathore-db msrathore-db marked this pull request as ready for review November 5, 2025 12:42
@github-actions github-actions bot added this to the ADBC Libraries 21 milestone Nov 5, 2025
private readonly object _errorLock = new object();

// Straggler mitigation fields
private readonly bool _isStragglerMitigationEnabled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move the change here and related logci to a new class?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants