Skip to content

Conversation

@lewismc
Copy link
Member

@lewismc lewismc commented Jan 9, 2026

See NUTCH-3142 for background.

This PR implements Missing Error Context (recommendation #8) from the Nutch Hadoop Metrics Analysis report. It introduces a centralized ErrorTracker utility that categorizes errors by type and emits structured Hadoop counters, replacing the previous approach of counting errors without categorization.

Changes

New Files

  • src/java/org/apache/nutch/metrics/ErrorTracker.java - Thread-safe error categorization utility that:

    • Defines 8 error categories: NETWORK, PROTOCOL, PARSING, URL, SCORING, INDEXING, TIMEOUT, OTHER
    • Automatically categorizes exceptions based on type and class name
    • Supports cached counters for performance in hot paths
    • Provides both local accumulation (recordError/emitCounters) and direct increment (incrementCounters) APIs
  • src/test/org/apache/nutch/metrics/TestErrorTracker.java - Comprehensive test suite with 26 tests covering:

    • Exception categorization for all error types
    • Nutch-specific exceptions (ProtocolException, ParseException, ScoringFilterException, etc.)
    • Cached counter initialization and usage
    • Thread safety
    • Nested cause chain handling

Modified Files

Metrics Constants (NutchMetrics.java)

  • Added standard error counter constants: ERROR_TOTAL, ERROR_NETWORK_TOTAL, ERROR_PROTOCOL_TOTAL, ERROR_PARSING_TOTAL, ERROR_URL_TOTAL, ERROR_SCORING_TOTAL, ERROR_INDEXING_TOTAL, ERROR_TIMEOUT_TOTAL, ERROR_OTHER_TOTAL
  • Removed redundant component-specific error counters (which I introduced initially in NUTCH-3132 Standardize existing Nutch metrics naming and implementation #871) now handled by ErrorTracker

Component Integrations

Component File Changes
Fetcher FetcherThread.java, Fetcher.java Integrated ErrorTracker for fetch error categorization
Parser ParseSegment.java Added error tracking for parsing and scoring exceptions
Indexer IndexerMapReduce.java Replaced errorsScoringFilterCounter and errorsIndexingFilterCounter with ErrorTracker
Generator Generator.java Replaced URL filter and malformed URL counters with ErrorTracker
Injector Injector.java Added error tracking for URL processing exceptions
CrawlDb CrawlDbReducer.java Added error tracking for scoring filter exceptions
HostDb UpdateHostDbMapper.java, ResolverThread.java Replaced malformedUrlCounter with ErrorTracker; added DNS resolution error tracking
Sitemap SitemapProcessor.java Added error tracking for sitemap processing exceptions
WARC WARCExporter.java Replaced exceptionCounter and invalidUriCounter with ErrorTracker

Dependencies (ivy/ivy.xml)

  • Added mockito-core and mockito-junit-jupiter (v5.18.0) as test dependencies. I had been thinking about doing this with some previous PR's but didn't want to introduce new dependencies to the project. In this case, it made for much cleaner more intuitive tests.

Benefits

  1. Better Debugging: Errors are now categorized by type, making it easier to identify patterns
  2. Reduced Counter Cardinality: Uses a fixed set of error categories (~10 counters) instead of unlimited component-specific counters
  3. Consistent API: All components use the same error tracking mechanism
  4. Performance: Cached counters avoid repeated lookups in hot paths, this is consistent with NUTCH-3141 Cache Hadoop Counter References in Hot Paths #878
  5. Thread Safety: ConcurrentHashMap ensures safe concurrent access

I've incorporated these new counters locally into nutch-grafana-resources collector configuration. and dashboards and will push those updates entirely separately. This patch is best tested by looking at Hadoop Counters in STDOUT/logging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant