NUTCH-1446 Port NUTCH-1444 to trunk (Indexing should not create tempo… by shishir-kuet · Pull Request #905 · apache/nutch

shishir-kuet · 2026-03-07T21:07:45Z

This PR ports the fix from NUTCH-1444 to trunk.

Changes:

Updated IndexerOutputFormat.java to avoid creating temporary files during indexing.

Verification:

Project builds successfully using ant.
Verified indexing pipeline locally.

JIRA Issue:
https://issues.apache.org/jira/browse/NUTCH-1446

…rary files)

Copilot

Pull request overview

Ports NUTCH-1444’s approach to avoid creating MapReduce temporary output files during indexing by replacing the file-based FileOutputFormat implementation with a non-file-writing OutputFormat and a no-op committer.

Changes:

Reworked IndexerOutputFormat to extend OutputFormat instead of FileOutputFormat, removing file output generation.
Updated writer initialization to no longer use FileOutputFormat-derived unique part file names.
Added no-op checkOutputSpecs and OutputCommitter implementations to align with “no output files” behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/java/org/apache/nutch/indexer/IndexerOutputFormat.java

Copilot · 2026-03-07T21:11:40Z

src/java/org/apache/nutch/indexer/IndexerOutputFormat.java

-    String name = getUniqueFile(context, "part", "");
-    writers.open(conf, name);
+    // open writers (no temporary file output anymore)
+    writers.open(conf, "index");


writers.open(conf, "index") uses a constant name, while the previous implementation used a task-unique name via getUniqueFile(...). If any IndexWriter implementation uses this name to create per-task resources (files, temp dirs, IDs, etc.), a constant can cause collisions between reducers/attempts. Consider incorporating the task attempt ID (or task ID) into the name to preserve the prior uniqueness without relying on FileOutputFormat.

Suggested change

writers.open(conf, "index");

String indexName = "index-" + context.getTaskAttemptID().toString();

writers.open(conf, indexName);

The method open(Configuration conf, String name) of the IndexWriter class is deprecated. Once it is finally removed, the indexName parameter is not used anymore. So, this is no real problem.

Copilot · 2026-03-07T21:11:40Z

src/java/org/apache/nutch/indexer/IndexerOutputFormat.java

+  @Override
+  public OutputCommitter getOutputCommitter(TaskAttemptContext context)
+      throws IOException, InterruptedException {
+
+    return new OutputCommitter() {
+
+      @Override
+      public void setupJob(JobContext jobContext) {}
+
+      @Override
+      public void setupTask(TaskAttemptContext taskContext) {}
+
+      @Override
+      public boolean needsTaskCommit(TaskAttemptContext taskContext) {
+        return false;
+      }
+
+      @Override
+      public void commitTask(TaskAttemptContext taskContext) {}
+
+      @Override
+      public void abortTask(TaskAttemptContext taskContext) {}
+    };


The anonymous no-op OutputCommitter here largely duplicates Hadoop's built-in null output commit semantics (see NullOutputFormat usage elsewhere, e.g. CleaningJob). To reduce maintenance risk across Hadoop upgrades (API surface changes) and keep behavior consistent, consider extending org.apache.hadoop.mapreduce.lib.output.NullOutputFormat and only overriding getRecordWriter, or otherwise reuse a shared no-op committer implementation instead of defining a new anonymous one.

@shishir-kuet can you address this issue? Thank you.

sebastian-nagel · 2026-03-08T11:31:03Z

Thanks, @shishir-kuet, for getting this really old and long-standing issue resolved!

Would you mind addressing the auto-reported comments?

The license header needs to be fixed. Also the RAT CI build has failed because of the erroneous license header.
Regarding the hint about Hadoop's NullOutputFormat: indeed it may simplify the solution. I know, the original patch in NUTCH-1444 has chosen a different way.

shishir-kuet · 2026-03-08T21:28:07Z

Thanks for the review.

I updated the code to include the task attempt ID in the index name
to avoid potential collisions between reducers.

Please let me know if further changes are required.

shishir-kuet · 2026-03-08T21:35:54Z

Thanks for the review.

I have addressed the comments:

Restored the full ASF license header
Updated the index name to include the task attempt ID
Cleaned up formatting issues

Please let me know if any further adjustments are needed.

sonarqubecloud · 2026-03-09T18:15:58Z

Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

shishir-kuet

Thanks for the feedback. The explanatory comment has been restored and the update pushed.

NUTCH-1446 Port NUTCH-1444 to trunk (Indexing should not create tempo…

267ba8b

…rary files)

Copilot AI review requested due to automatic review settings March 7, 2026 21:07

Copilot started reviewing on behalf of shishir-kuet March 7, 2026 21:08 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

NUTCH-1446 Address review comments (unique index name)

88bb557

shishir-kuet closed this Mar 8, 2026

shishir-kuet reopened this Mar 8, 2026

Restore explanatory comment for noCommit logic

7e33d78

shishir-kuet commented Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUTCH-1446 Port NUTCH-1444 to trunk (Indexing should not create tempo…#905

NUTCH-1446 Port NUTCH-1444 to trunk (Indexing should not create tempo…#905
shishir-kuet wants to merge 3 commits intoapache:masterfrom
shishir-kuet:NUTCH-1446-fix

shishir-kuet commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

sebastian-nagel Mar 8, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

lewismc Mar 9, 2026

Uh oh!

sebastian-nagel commented Mar 8, 2026 •

edited

Loading

Uh oh!

shishir-kuet commented Mar 8, 2026

Uh oh!

shishir-kuet commented Mar 8, 2026

Uh oh!

sonarqubecloud bot commented Mar 9, 2026

Uh oh!

shishir-kuet left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	writers.open(conf, "index");
	String indexName = "index-" + context.getTaskAttemptID().toString();
	writers.open(conf, indexName);

Conversation

shishir-kuet commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

lewismc Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shishir-kuet commented Mar 8, 2026

Uh oh!

shishir-kuet commented Mar 8, 2026

Uh oh!

sonarqubecloud bot commented Mar 9, 2026

Quality Gate failed

Uh oh!

shishir-kuet left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sebastian-nagel commented Mar 8, 2026 •

edited

Loading