Skip to content

[NUTCH-3160] Remove System.exit(..) from reusable code#903

Merged
lewismc merged 4 commits intoapache:masterfrom
lfoppiano:bugfix/NUTCH-3160
Feb 27, 2026
Merged

[NUTCH-3160] Remove System.exit(..) from reusable code#903
lewismc merged 4 commits intoapache:masterfrom
lfoppiano:bugfix/NUTCH-3160

Conversation

@lfoppiano
Copy link
Contributor

This small patch removes a System.exit() within the code, causing e.g. tests to be terminated without notice.

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lfoppiano congratulations on your first pull request!
For each error message, we can now capture and increase a counter metric as well. See the ErrorTracker Javadoc for more details.
Thank you

@lfoppiano
Copy link
Contributor Author

@lewismc thank you for the review. I'm not sure I understood well, you want me to use the error tracker for this specific error?

Would something like the following be OK?

  /**
   * Configurable constructor
   * @param config A populated {@link CommonCrawlConfig}
   */
  public CommonCrawlDataDumper(CommonCrawlConfig config) {
    this(); 
    this.config = config;
  }

  /**
   * Constructor
   */
  public CommonCrawlDataDumper() {
    this.errorTracker = new ErrorTracker(NutchMetrics.ERROR_OTHER_TOTAL);
  }

and then something like this:

if (parts == null || parts.size() == 0) {
      LOG.error( "No segment directories found in {} ",
          segmentRootDir.getAbsolutePath());
      this.errorTracker.recordError(ErrorTracker.ErrorType.OTHER);
      return;
    }

I'm not sure which type of error I should use actually.

Thank you in advance

@lfoppiano
Copy link
Contributor Author

Another question, @lewismc , would you like me to apply this to all error messages or only to the one related to the initial change?

@sebastian-nagel
Copy link
Contributor

apply this to all error messages or only to the one related to the initial change?

@lfoppiano: I think it's sufficient to just track this error for now.

PRs should be focused on a single problem, otherwise they tend to take longer to be ready for merge.

We can open another issue to roll-out the error tracker entirely. What do you think, @lewismc ?

@lfoppiano lfoppiano requested a review from lewismc February 26, 2026 16:53
@lewismc
Copy link
Member

lewismc commented Feb 26, 2026

Hi @lfoppiano I sent you a small PR to address some issues.

This very small change has uncovered something really interesting about how metrics can/should be implemented in Nutch Tools Vs Nutch MapReduce jobs.

The core issue is that in Nutch Tools (such as CommonCrawlDataDumper) counters are never actually emitted!

CommonCrawlDataDumper runs as a NutchTool (CLI or REST). It does not run inside a MapReduce job, so there is no TaskInputOutputContext (mapper/reducer context).
How ErrorTracker gets counts out: In MapReduce, components either:

  • Use the two-arg constructor ErrorTracker(group, context), which caches counters from the context, then call incrementCounters(type) so each error is written straight to Hadoop counters, or
  • Use the one-arg constructor ErrorTracker(group), call recordError(type) (which only updates in-memory counts), and then in cleanup call emitCounters(context) to flush those counts to the job’s counters.

What the dumper does: It uses the one-arg constructor and calls recordError(ErrorType.OTHER) in this case when there are no segment directories. That only updates the tracker’s internal AtomicLongs. There is no mapper/reducer cleanup phase and no context, so emitCounters(context) is never called.

The result is that errors are counted in memory inside the dumper process, but they are never written to Hadoop’s counter system. They don’t show up in the job UI, in job history, or in any tool that reads job counters. The behavior can be considered as “record but don’t report.”... which for now is fine.

As we continue to evolve the Nutch metrics system we can design how these errors can be made visible in the same way as in MapReduce jobs. Off the top of my head, we would likely have to add another reporting path (e.g. logging, a file, or a different metrics backend) because the Hadoop counter path is not available without a MapReduce context. This is an architectural choice and functional limitation of being so closely tied to the Hadoop metrics systems.

@sonarqubecloud
Copy link

@lfoppiano
Copy link
Contributor Author

Thanks @lewismc for the extensive explanation.
I was indeed a bit confused earlier about the fact that this Dump class does not run inside Hadoop, and now is definitely more clear.

Probably a simpler approach would be to have a logger reading those metrics and output them either regularly or when they change, in the log console.

I did merged your PR in my branch, let me know if this PR is good to go.

Thanks!

@lewismc
Copy link
Member

lewismc commented Feb 27, 2026

Probably a simpler approach would be to have a logger reading those metrics and output them either regularly or when they change, in the log console.

I don't disagree 😄 I do think that we could sort that out in a different/follow-up issue though.

This PR is good. I'll merge.

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 thanks for accommodating the suggestions @lfoppiano

@lewismc lewismc merged commit da55708 into apache:master Feb 27, 2026
9 checks passed
@lfoppiano lfoppiano deleted the bugfix/NUTCH-3160 branch February 27, 2026 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants