Skip to content

output.segment.dir.uri is only read on ClusterInfo configs and not from TaskConfig in MergeRollupTask #17014

@t0mpere

Description

@t0mpere

Based on this discussion on Slack.

I've found a bug where output.segment.dir.uri is never read from the task config and always from the controller config. This leads to an edge case where if the deep store is not configured globally, it's impossible to run a metadata job.

The function getPushTaskConfig should prioritise taskConfig over global controllerConfig.

I will open a PR to fix and refactor the function if we agree on the behaviour.

Example:

Controller config

controller.data.dir=/var/pinot/controller/data

Task config

"MergeRollupTask": {
          "1day.mergeType": "concat",
          "1day.bucketTimePeriod": "1d",
          "1day.bufferTimePeriod": "1d",
          "1day.maxNumRecordsPerSegment": "100000",
          "1day.maxNumRecordsPerTask": "500000",
          "1day.maxNumParallelBuckets": "10",
          "minionInstanceTag": "merge",
          "push.mode": "METADATA",
          "output.segment.dir.uri": "gs://my-bucket/LOADED_HOURLY/merged",
          "schedule": "0 1 * * * ?"
        }

Result

{
  "configs": {
    "push.mode": "TAR",
    ...
    "output.segment.dir.uri": "/var/pinot/controller/data/LOADED_HOURLY",
    ...
  },
  "tableName": "LOADED_HOURLY_OFFLINE",
  "taskId": "Task_MergeRollupTask_5cd6364b-7012-4f60-8e8b-58ee4a2196c1_1759943074336_0",
  "taskType": "MergeRollupTask"
}

Expected

{
  "configs": {
    "push.mode": "METADATA",
    ...
    "output.segment.dir.uri": "gs://my-bucket/LOADED_HOURLY/merged",
    ...
  },
  "tableName": "LOADED_HOURLY_OFFLINE",
  "taskId": "Task_MergeRollupTask_5cd6364b-7012-4f60-8e8b-58ee4a2196c1_1759943074336_0",
  "taskType": "MergeRollupTask"
}

cc: @shounakmk219

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions