Skip to content

Conversation

hongkunxu
Copy link
Contributor

@hongkunxu hongkunxu commented Sep 25, 2025

Description

Currently, Pinot’s DataIngestionJob has a limitation when performing backfill ingestion. The job assumes that the backfill run will generate the same number of segments (or more) compared to the original ingestion.

When the backfill input directory contains fewer files than the original run, the segment generation job will produce fewer segments. As a result, only part of the existing segments will be replaced, and the remaining old segments will continue to exist in the table, causing stale data issues.

Example

  • Suppose table airlineStats has 2 segments for 2014-01-01:
    - airlineStats_2014-01-01_2014-01-01_0
    - airlineStats_2014-01-01_2014-01-01_1
  • The backfill input directory only contains 1 input file for the same date.
  • The segment generation job produces just 1 segment:
    - airlineStats_2014-01-01_2014-01-01_0
  • After pushing, only _0 gets replaced, while _1 from the original ingestion is still present, leading to incorrect/stale data.

Impact

If raw data changes such that a given time bucket has fewer input files than the first ingestion run, backfill will fail to fully replace existing segments. This makes it difficult to rely on backfill for correcting historical data.

Proposal

I have developed a new job command called: BackfillIngestionJobCommand. This job leverages Pinot's segment lineage functionality as its core mechanism. And the detail workflow consist of following steps:

  1. Fetch Segments: Retrieve the existing segments that are scheduled for backfill and replacement.
  2. Generate Segments: Execute the segment generation job locally to produce new segments.
  3. Create Lineage: Establish a lineage entry to track the relationship between old and newly generated segments.
  4. Push & Finalize: Upload the new segments to Pinot and mark the lineage entry as completed to make them active.

fix: #16889

@codecov-commenter
Copy link

codecov-commenter commented Sep 25, 2025

Codecov Report

❌ Patch coverage is 6.17284% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.53%. Comparing base (4ba63e2) to head (9aca09a).
⚠️ Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
...rg/apache/pinot/common/utils/SegmentApiClient.java 5.00% 76 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #16890      +/-   ##
============================================
+ Coverage     63.49%   63.53%   +0.03%     
- Complexity     1410     1411       +1     
============================================
  Files          3067     3071       +4     
  Lines        180097   180398     +301     
  Branches      27555    27591      +36     
============================================
+ Hits         114355   114617     +262     
- Misses        56943    56957      +14     
- Partials       8799     8824      +25     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.49% <6.17%> (+0.01%) ⬆️
java-21 63.47% <6.17%> (+0.01%) ⬆️
temurin 63.53% <6.17%> (+0.03%) ⬆️
unittests 63.53% <6.17%> (+0.03%) ⬆️
unittests1 56.39% <6.17%> (+0.01%) ⬆️
unittests2 33.62% <0.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new backfill ingestion job command to address limitations in Pinot's current DataIngestionJob when performing backfill operations with fewer segments than the original ingestion.

Key changes:

  • Introduces LaunchBackfillIngestionJobCommand that uses segment lineage functionality to ensure complete segment replacement
  • Modifies LaunchDataIngestionJobCommand to make key fields protected for inheritance
  • Adds a new SegmentApiClient utility for segment-related API operations
  • Integrates SSL context support for the new API client

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
LaunchDataIngestionJobCommand.java Changes field visibility from private to protected to enable inheritance
LaunchBackfillIngestionJobCommand.java New command that implements complete backfill workflow using segment lineage
PinotAdministrator.java Registers the new LaunchBackfillIngestionJobCommand
TlsUtils.java Installs default SSL context for SegmentApiClient
SegmentApiClient.java New utility class for segment API operations with SSL support

hongkunxu and others added 3 commits September 26, 2025 09:58
…/LaunchBackfillIngestionJobCommand.java

Co-authored-by: Copilot <[email protected]>
…/LaunchBackfillIngestionJobCommand.java

Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] BackfillIngestionJob: Ensure Stale Segments Are Removed

4 participants