Skip to content

[Feature] BackfillIngestionJob: Ensure Stale Segments Are Removed #16889

@hongkunxu

Description

@hongkunxu

Description

Currently, Pinot’s DataIngestionJob has a limitation when performing backfill ingestion. The job assumes that the backfill run will generate the same number of segments (or more) compared to the original ingestion.

When the backfill input directory contains fewer files than the original run, the segment generation job will produce fewer segments. As a result, only part of the existing segments will be replaced, and the remaining old segments will continue to exist in the table, causing stale data issues.

Example

  • Suppose table airlineStats has 2 segments for 2014-01-01:
    - airlineStats_2014-01-01_2014-01-01_0
    - airlineStats_2014-01-01_2014-01-01_1

  • The backfill input directory only contains 1 input file for the same date.

  • The segment generation job produces just 1 segment:
    - airlineStats_2014-01-01_2014-01-01_0

  • After pushing, only _0 gets replaced, while _1 from the original ingestion is still present, leading to incorrect/stale data.

Impact

If raw data changes such that a given time bucket has fewer input files than the first ingestion run, backfill will fail to fully replace existing segments. This makes it difficult to rely on backfill for correcting historical data.

Proposal

Introduce a new job, tentatively named BackfillIngestionJob, which is designed to correctly handle these edge cases. This job should:

  1. Ensure that all original segments in the target time range are replaced/removed.
  2. Guarantee that stale data from older segments does not persist after backfill.
  3. Provide a consistent and reliable workflow for batch backfill ingestion.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions