NUTCH-1446 Port NUTCH-1444 to trunk (Indexing should not create tempo…#905
NUTCH-1446 Port NUTCH-1444 to trunk (Indexing should not create tempo…#905shishir-kuet wants to merge 3 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Ports NUTCH-1444’s approach to avoid creating MapReduce temporary output files during indexing by replacing the file-based FileOutputFormat implementation with a non-file-writing OutputFormat and a no-op committer.
Changes:
- Reworked
IndexerOutputFormatto extendOutputFormatinstead ofFileOutputFormat, removing file output generation. - Updated writer initialization to no longer use
FileOutputFormat-derived unique part file names. - Added no-op
checkOutputSpecsandOutputCommitterimplementations to align with “no output files” behavior.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| String name = getUniqueFile(context, "part", ""); | ||
| writers.open(conf, name); | ||
| // open writers (no temporary file output anymore) | ||
| writers.open(conf, "index"); |
There was a problem hiding this comment.
writers.open(conf, "index") uses a constant name, while the previous implementation used a task-unique name via getUniqueFile(...). If any IndexWriter implementation uses this name to create per-task resources (files, temp dirs, IDs, etc.), a constant can cause collisions between reducers/attempts. Consider incorporating the task attempt ID (or task ID) into the name to preserve the prior uniqueness without relying on FileOutputFormat.
| writers.open(conf, "index"); | |
| String indexName = "index-" + context.getTaskAttemptID().toString(); | |
| writers.open(conf, indexName); |
There was a problem hiding this comment.
The method open(Configuration conf, String name) of the IndexWriter class is deprecated. Once it is finally removed, the indexName parameter is not used anymore. So, this is no real problem.
| @Override | ||
| public OutputCommitter getOutputCommitter(TaskAttemptContext context) | ||
| throws IOException, InterruptedException { | ||
|
|
||
| return new OutputCommitter() { | ||
|
|
||
| @Override | ||
| public void setupJob(JobContext jobContext) {} | ||
|
|
||
| @Override | ||
| public void setupTask(TaskAttemptContext taskContext) {} | ||
|
|
||
| @Override | ||
| public boolean needsTaskCommit(TaskAttemptContext taskContext) { | ||
| return false; | ||
| } | ||
|
|
||
| @Override | ||
| public void commitTask(TaskAttemptContext taskContext) {} | ||
|
|
||
| @Override | ||
| public void abortTask(TaskAttemptContext taskContext) {} | ||
| }; |
There was a problem hiding this comment.
The anonymous no-op OutputCommitter here largely duplicates Hadoop's built-in null output commit semantics (see NullOutputFormat usage elsewhere, e.g. CleaningJob). To reduce maintenance risk across Hadoop upgrades (API surface changes) and keep behavior consistent, consider extending org.apache.hadoop.mapreduce.lib.output.NullOutputFormat and only overriding getRecordWriter, or otherwise reuse a shared no-op committer implementation instead of defining a new anonymous one.
There was a problem hiding this comment.
@shishir-kuet can you address this issue? Thank you.
|
Thanks, @shishir-kuet, for getting this really old and long-standing issue resolved! Would you mind addressing the auto-reported comments?
|
|
Thanks for the review. I updated the code to include the task attempt ID in the index name Please let me know if further changes are required. |
|
Thanks for the review. I have addressed the comments:
Please let me know if any further adjustments are needed. |
|
shishir-kuet
left a comment
There was a problem hiding this comment.
Thanks for the feedback. The explanatory comment has been restored and the update pushed.


This PR ports the fix from NUTCH-1444 to trunk.
Changes:
Verification:
JIRA Issue:
https://issues.apache.org/jira/browse/NUTCH-1446