Skip to content

NUTCH-1446 Port NUTCH-1444 to trunk (Indexing should not create tempo…#905

Open
shishir-kuet wants to merge 3 commits intoapache:masterfrom
shishir-kuet:NUTCH-1446-fix
Open

NUTCH-1446 Port NUTCH-1444 to trunk (Indexing should not create tempo…#905
shishir-kuet wants to merge 3 commits intoapache:masterfrom
shishir-kuet:NUTCH-1446-fix

Conversation

@shishir-kuet
Copy link

This PR ports the fix from NUTCH-1444 to trunk.

Changes:

  • Updated IndexerOutputFormat.java to avoid creating temporary files during indexing.

Verification:

  • Project builds successfully using ant.
  • Verified indexing pipeline locally.

JIRA Issue:
https://issues.apache.org/jira/browse/NUTCH-1446

Copilot AI review requested due to automatic review settings March 7, 2026 21:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ports NUTCH-1444’s approach to avoid creating MapReduce temporary output files during indexing by replacing the file-based FileOutputFormat implementation with a non-file-writing OutputFormat and a no-op committer.

Changes:

  • Reworked IndexerOutputFormat to extend OutputFormat instead of FileOutputFormat, removing file output generation.
  • Updated writer initialization to no longer use FileOutputFormat-derived unique part file names.
  • Added no-op checkOutputSpecs and OutputCommitter implementations to align with “no output files” behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

String name = getUniqueFile(context, "part", "");
writers.open(conf, name);
// open writers (no temporary file output anymore)
writers.open(conf, "index");
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writers.open(conf, "index") uses a constant name, while the previous implementation used a task-unique name via getUniqueFile(...). If any IndexWriter implementation uses this name to create per-task resources (files, temp dirs, IDs, etc.), a constant can cause collisions between reducers/attempts. Consider incorporating the task attempt ID (or task ID) into the name to preserve the prior uniqueness without relying on FileOutputFormat.

Suggested change
writers.open(conf, "index");
String indexName = "index-" + context.getTaskAttemptID().toString();
writers.open(conf, indexName);

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method open(Configuration conf, String name) of the IndexWriter class is deprecated. Once it is finally removed, the indexName parameter is not used anymore. So, this is no real problem.

Comment on lines +75 to +97
@Override
public OutputCommitter getOutputCommitter(TaskAttemptContext context)
throws IOException, InterruptedException {

return new OutputCommitter() {

@Override
public void setupJob(JobContext jobContext) {}

@Override
public void setupTask(TaskAttemptContext taskContext) {}

@Override
public boolean needsTaskCommit(TaskAttemptContext taskContext) {
return false;
}

@Override
public void commitTask(TaskAttemptContext taskContext) {}

@Override
public void abortTask(TaskAttemptContext taskContext) {}
};
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The anonymous no-op OutputCommitter here largely duplicates Hadoop's built-in null output commit semantics (see NullOutputFormat usage elsewhere, e.g. CleaningJob). To reduce maintenance risk across Hadoop upgrades (API surface changes) and keep behavior consistent, consider extending org.apache.hadoop.mapreduce.lib.output.NullOutputFormat and only overriding getRecordWriter, or otherwise reuse a shared no-op committer implementation instead of defining a new anonymous one.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shishir-kuet can you address this issue? Thank you.

@sebastian-nagel
Copy link
Contributor

sebastian-nagel commented Mar 8, 2026

Thanks, @shishir-kuet, for getting this really old and long-standing issue resolved!

Would you mind addressing the auto-reported comments?

  • The license header needs to be fixed. Also the RAT CI build has failed because of the erroneous license header.
  • Regarding the hint about Hadoop's NullOutputFormat: indeed it may simplify the solution. I know, the original patch in NUTCH-1444 has chosen a different way.

@shishir-kuet
Copy link
Author

Thanks for the review.

I updated the code to include the task attempt ID in the index name
to avoid potential collisions between reducers.

Please let me know if further changes are required.

@shishir-kuet shishir-kuet reopened this Mar 8, 2026
@shishir-kuet
Copy link
Author

Thanks for the review.

I have addressed the comments:

  • Restored the full ASF license header
  • Updated the index name to include the task attempt ID
  • Cleaned up formatting issues

Please let me know if any further adjustments are needed.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 9, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Copy link
Author

@shishir-kuet shishir-kuet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. The explanatory comment has been restored and the update pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants