-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[Feature] Develop LaunchBackfillIngestionJob to support complete backfill #16890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Hongkun Xu <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #16890 +/- ##
============================================
+ Coverage 63.49% 63.53% +0.03%
- Complexity 1410 1411 +1
============================================
Files 3067 3071 +4
Lines 180097 180398 +301
Branches 27555 27591 +36
============================================
+ Hits 114355 114617 +262
- Misses 56943 56957 +14
- Partials 8799 8824 +25
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new backfill ingestion job command to address limitations in Pinot's current DataIngestionJob when performing backfill operations with fewer segments than the original ingestion.
Key changes:
- Introduces LaunchBackfillIngestionJobCommand that uses segment lineage functionality to ensure complete segment replacement
- Modifies LaunchDataIngestionJobCommand to make key fields protected for inheritance
- Adds a new SegmentApiClient utility for segment-related API operations
- Integrates SSL context support for the new API client
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
LaunchDataIngestionJobCommand.java | Changes field visibility from private to protected to enable inheritance |
LaunchBackfillIngestionJobCommand.java | New command that implements complete backfill workflow using segment lineage |
PinotAdministrator.java | Registers the new LaunchBackfillIngestionJobCommand |
TlsUtils.java | Installs default SSL context for SegmentApiClient |
SegmentApiClient.java | New utility class for segment API operations with SSL support |
...ls/src/main/java/org/apache/pinot/tools/admin/command/LaunchBackfillIngestionJobCommand.java
Outdated
Show resolved
Hide resolved
...ls/src/main/java/org/apache/pinot/tools/admin/command/LaunchBackfillIngestionJobCommand.java
Outdated
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/utils/SegmentApiClient.java
Outdated
Show resolved
Hide resolved
…/LaunchBackfillIngestionJobCommand.java Co-authored-by: Copilot <[email protected]>
…/LaunchBackfillIngestionJobCommand.java Co-authored-by: Copilot <[email protected]>
…ntApiClient.java Co-authored-by: Copilot <[email protected]>
Description
Currently, Pinot’s DataIngestionJob has a limitation when performing backfill ingestion. The job assumes that the backfill run will generate the same number of segments (or more) compared to the original ingestion.
When the backfill input directory contains fewer files than the original run, the segment generation job will produce fewer segments. As a result, only part of the existing segments will be replaced, and the remaining old segments will continue to exist in the table, causing stale data issues.
Example
- airlineStats_2014-01-01_2014-01-01_0
- airlineStats_2014-01-01_2014-01-01_1
- airlineStats_2014-01-01_2014-01-01_0
Impact
If raw data changes such that a given time bucket has fewer input files than the first ingestion run, backfill will fail to fully replace existing segments. This makes it difficult to rely on backfill for correcting historical data.
Proposal
I have developed a new job command called: BackfillIngestionJobCommand. This job leverages Pinot's segment lineage functionality as its core mechanism. And the detail workflow consist of following steps:
fix: #16889