NUTCH-3142 Add Error Context to Metrics #882
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See NUTCH-3142 for background.
This PR implements Missing Error Context (recommendation #8) from the Nutch Hadoop Metrics Analysis report. It introduces a centralized
ErrorTrackerutility that categorizes errors by type and emits structured Hadoop counters, replacing the previous approach of counting errors without categorization.Changes
New Files
src/java/org/apache/nutch/metrics/ErrorTracker.java- Thread-safe error categorization utility that:NETWORK,PROTOCOL,PARSING,URL,SCORING,INDEXING,TIMEOUT,OTHERrecordError/emitCounters) and direct increment (incrementCounters) APIssrc/test/org/apache/nutch/metrics/TestErrorTracker.java- Comprehensive test suite with 26 tests covering:Modified Files
Metrics Constants (
NutchMetrics.java)ERROR_TOTAL,ERROR_NETWORK_TOTAL,ERROR_PROTOCOL_TOTAL,ERROR_PARSING_TOTAL,ERROR_URL_TOTAL,ERROR_SCORING_TOTAL,ERROR_INDEXING_TOTAL,ERROR_TIMEOUT_TOTAL,ERROR_OTHER_TOTALErrorTrackerComponent Integrations
FetcherThread.java,Fetcher.javaErrorTrackerfor fetch error categorizationParseSegment.javaIndexerMapReduce.javaerrorsScoringFilterCounteranderrorsIndexingFilterCounterwithErrorTrackerGenerator.javaErrorTrackerInjector.javaCrawlDbReducer.javaUpdateHostDbMapper.java,ResolverThread.javamalformedUrlCounterwithErrorTracker; added DNS resolution error trackingSitemapProcessor.javaWARCExporter.javaexceptionCounterandinvalidUriCounterwithErrorTrackerDependencies (
ivy/ivy.xml)mockito-coreandmockito-junit-jupiter(v5.18.0) as test dependencies. I had been thinking about doing this with some previous PR's but didn't want to introduce new dependencies to the project. In this case, it made for much cleaner more intuitive tests.Benefits
ConcurrentHashMapensures safe concurrent accessI've incorporated these new counters locally into nutch-grafana-resources collector configuration. and dashboards and will push those updates entirely separately. This patch is best tested by looking at Hadoop Counters in STDOUT/logging.