[NUTCH-3160] Remove System.exit(..) from reusable code by lfoppiano · Pull Request #903 · apache/nutch

lfoppiano · 2026-02-25T22:21:31Z

This small patch removes a System.exit() within the code, causing e.g. tests to be terminated without notice.

lewismc

Hi @lfoppiano congratulations on your first pull request!
For each error message, we can now capture and increase a counter metric as well. See the ErrorTracker Javadoc for more details.
Thank you

lfoppiano · 2026-02-26T07:17:50Z

@lewismc thank you for the review. I'm not sure I understood well, you want me to use the error tracker for this specific error?

Would something like the following be OK?

  /**
   * Configurable constructor
   * @param config A populated {@link CommonCrawlConfig}
   */
  public CommonCrawlDataDumper(CommonCrawlConfig config) {
    this(); 
    this.config = config;
  }

  /**
   * Constructor
   */
  public CommonCrawlDataDumper() {
    this.errorTracker = new ErrorTracker(NutchMetrics.ERROR_OTHER_TOTAL);
  }

and then something like this:

if (parts == null || parts.size() == 0) {
      LOG.error( "No segment directories found in {} ",
          segmentRootDir.getAbsolutePath());
      this.errorTracker.recordError(ErrorTracker.ErrorType.OTHER);
      return;
    }

I'm not sure which type of error I should use actually.

Thank you in advance

lfoppiano · 2026-02-26T07:19:13Z

Another question, @lewismc , would you like me to apply this to all error messages or only to the one related to the initial change?

sebastian-nagel · 2026-02-26T08:20:32Z

apply this to all error messages or only to the one related to the initial change?

@lfoppiano: I think it's sufficient to just track this error for now.

PRs should be focused on a single problem, otherwise they tend to take longer to be ready for merge.

We can open another issue to roll-out the error tracker entirely. What do you think, @lewismc ?

[NUTCH-3160] Remove System.exit(..) from reusable code

lewismc · 2026-02-26T17:00:23Z

Hi @lfoppiano I sent you a small PR to address some issues.

This very small change has uncovered something really interesting about how metrics can/should be implemented in Nutch Tools Vs Nutch MapReduce jobs.

The core issue is that in Nutch Tools (such as CommonCrawlDataDumper) counters are never actually emitted!

CommonCrawlDataDumper runs as a NutchTool (CLI or REST). It does not run inside a MapReduce job, so there is no TaskInputOutputContext (mapper/reducer context).
How ErrorTracker gets counts out: In MapReduce, components either:

Use the two-arg constructor ErrorTracker(group, context), which caches counters from the context, then call incrementCounters(type) so each error is written straight to Hadoop counters, or
Use the one-arg constructor ErrorTracker(group), call recordError(type) (which only updates in-memory counts), and then in cleanup call emitCounters(context) to flush those counts to the job’s counters.

What the dumper does: It uses the one-arg constructor and calls recordError(ErrorType.OTHER) in this case when there are no segment directories. That only updates the tracker’s internal AtomicLongs. There is no mapper/reducer cleanup phase and no context, so emitCounters(context) is never called.

The result is that errors are counted in memory inside the dumper process, but they are never written to Hadoop’s counter system. They don’t show up in the job UI, in job history, or in any tool that reads job counters. The behavior can be considered as “record but don’t report.”... which for now is fine.

As we continue to evolve the Nutch metrics system we can design how these errors can be made visible in the same way as in MapReduce jobs. Off the top of my head, we would likely have to add another reporting path (e.g. logging, a file, or a different metrics backend) because the Hadoop counter path is not available without a MapReduce context. This is an architectural choice and functional limitation of being so closely tied to the Hadoop metrics systems.

sonarqubecloud · 2026-02-26T17:27:01Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

lfoppiano · 2026-02-26T17:41:35Z

Thanks @lewismc for the extensive explanation.
I was indeed a bit confused earlier about the fact that this Dump class does not run inside Hadoop, and now is definitely more clear.

Probably a simpler approach would be to have a logger reading those metrics and output them either regularly or when they change, in the log console.

I did merged your PR in my branch, let me know if this PR is good to go.

Thanks!

lewismc · 2026-02-27T00:24:31Z

Probably a simpler approach would be to have a logger reading those metrics and output them either regularly or when they change, in the log console.

I don't disagree 😄 I do think that we could sort that out in a different/follow-up issue though.

This PR is good. I'll merge.

lewismc

+1 thanks for accommodating the suggestions @lfoppiano

NUTCH-3160 Remove System.exit(..) from reusable code

b5d9004

lfoppiano marked this pull request as ready for review February 25, 2026 22:31

This was referenced Feb 25, 2026

Enable report for running tests commoncrawl/nutch#45

Closed

Some tests are ignored when running through ant commoncrawl/nutch#43

Closed

Fix tests org.commoncrawl.* being ignored/skipped commoncrawl/nutch#44

Draft

lewismc self-requested a review February 26, 2026 03:35

lewismc requested changes Feb 26, 2026

View reviewed changes

lfoppiano and others added 3 commits February 26, 2026 09:30

NUTCH-3160 Log error in the ErrorTracker

4a2bb0b

[NUTCH-3160] Remove System.exit(..) from reusable code

35b1fbc

Merge pull request #1 from lewismc/bugfix/NUTCH-3160

ef28c58

[NUTCH-3160] Remove System.exit(..) from reusable code

lfoppiano requested a review from lewismc February 26, 2026 16:53

lewismc approved these changes Feb 27, 2026

View reviewed changes

lewismc merged commit da55708 into apache:master Feb 27, 2026
9 checks passed

lewismc assigned lfoppiano Feb 27, 2026

lfoppiano deleted the bugfix/NUTCH-3160 branch February 27, 2026 04:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NUTCH-3160] Remove System.exit(..) from reusable code#903

[NUTCH-3160] Remove System.exit(..) from reusable code#903
lewismc merged 4 commits intoapache:masterfrom
lfoppiano:bugfix/NUTCH-3160

lfoppiano commented Feb 25, 2026

Uh oh!

lewismc left a comment

Uh oh!

lfoppiano commented Feb 26, 2026

Uh oh!

lfoppiano commented Feb 26, 2026

Uh oh!

sebastian-nagel commented Feb 26, 2026

Uh oh!

lewismc commented Feb 26, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Feb 26, 2026

Uh oh!

lfoppiano commented Feb 26, 2026

Uh oh!

lewismc commented Feb 27, 2026

Uh oh!

lewismc left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lfoppiano commented Feb 25, 2026

Uh oh!

lewismc left a comment

Choose a reason for hiding this comment

Uh oh!

lfoppiano commented Feb 26, 2026

Uh oh!

lfoppiano commented Feb 26, 2026

Uh oh!

sebastian-nagel commented Feb 26, 2026

Uh oh!

lewismc commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 26, 2026

Quality Gate passed

Uh oh!

lfoppiano commented Feb 26, 2026

Uh oh!

lewismc commented Feb 27, 2026

Uh oh!

lewismc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lewismc commented Feb 26, 2026 •

edited

Loading