Skip to content

Conversation

@lewismc
Copy link
Member

@lewismc lewismc commented Sep 12, 2024

Work in Progress

This PR begins to address NUTCH-3064 by performing the upgrade of the com.maxmind.geoip2:geoip2 dependency to v4.2.0. It has not been tested in distributed Nutch deployment as of yet. I say this because although no additional dependencies have been added I will wish to test out a full deployment.

In addition to the proposed upgrade I performed some refactoring which I considered to be improvements.

Refactoring/Improvements

  1. Establishes unit test(s). I have more work to do here to accommodate the change in logic for loading the maxmind db file(s) from the class path.
  2. Removes duplication of configuration documentation, including it only in nutch-default.xml.
  3. Removes insightsService as the default value for the index.geoip.usage configuration property. The value is now empty.
  4. Introduces a new property index.geoip.db.file which facilitates specifying the Maxmind DB file packaged with Nutch .job.
  5. Adds Javadoc to every Class and Method of the index-geoid plugin (more work to be done here)
  6. Uses the updated GeoIP Database guidance, specifically
  • Using the try methods; "...If you are looking up many IPs that are not contained in the database, the try method will be slightly faster as they do not need to construct and throw an exception."
  • Uses DB Caching; "... Using this cache, lookup performance is significantly improved at the cost of a small (~2MB) memory overhead."
  1. Updates the number of fields which are now available for each Database as new fields h ave been added to the Java API since I first wrote this plugin.
  2. Simplifies the values available for the index.geoip.usage configuration property. Available values are now anonymous, asn, city, connection, domain, insights or isp. THIS IS A BACKWARDS INCOMPATIBLE BREAKING CHANGE which we would need to call out in the release notes. I decided to implement this change based on recent feedback which I agree with btw.

Future work

I can anticipate a use case where multiple Maxmind DB's and/or Web service looksups may wish to be chained together with the results being aggregated within one NutchDocument. I did not wish to complicate this PR any more though so any implementation will be described first in another Jira ticket.

@lewismc lewismc marked this pull request as draft September 12, 2024 23:56
@lewismc lewismc self-assigned this Sep 13, 2024
@lewismc lewismc changed the title WIP NUTCH-3064 Upgrade com.maxmind.geoip2:geoip2 dependency in geoip-index to v4.2.0 NUTCH-3064 Upgrade com.maxmind.geoip2:geoip2 dependency in geoip-index Jan 12, 2026
@lewismc lewismc marked this pull request as ready for review January 12, 2026 00:32
@lewismc lewismc changed the title NUTCH-3064 Upgrade com.maxmind.geoip2:geoip2 dependency in geoip-index NUTCH-3064 Upgrade index-geoip to GeoIP2 5.0.2 Jan 12, 2026
@lewismc
Copy link
Member Author

lewismc commented Jan 12, 2026

This PR now upgrades the index-geoip plugin to use MaxMind GeoIP2 Java API 5.0.2, with significant architectural improvements including support for multiple database types and in-memory caching.

Changes

Dependency Updates

  • geoip2: upgraded to 5.0.2
  • maxmind-db: upgraded to 4.0.2
  • jackson-datatype-jsr310: added 2.20.1 (new transitive dependency)

Performance Improvement — CHMCache

Database readers now use CHMCache (ConcurrentHashMap Cache) from the maxmind-db library for improved lookup performance:

DatabaseReader reader = new DatabaseReader.Builder(db)
    .withCache(new CHMCache())
    .build();

This caches parsed database nodes in memory, reducing disk I/O and improving throughput when the same IP prefixes are queried repeatedly during indexing.

New Configuration Options in conf/nutch-default.xml

The plugin now supports multiple database types simultaneously. Configure each by setting its file path:

Property Description
index.geoip.db.anonymous Anonymous IP database — identifies VPNs, proxies, Tor exit nodes
index.geoip.db.asn ASN database — autonomous system number and organization
index.geoip.db.city City database — city, subdivision, country, continent, coordinates
index.geoip.db.connection Connection Type database — Cable/DSL, Cellular, Corporate, Satellite
index.geoip.db.domain Domain database — second-level domain for the IP
index.geoip.db.isp ISP database — ISP name, organization, ASN

MaxMind Insights Web Service Support

Property Description
index.geoip.insights.userid User ID for MaxMind Precision Insights API
index.geoip.insights.licensekey License key for the Insights API

Architecture Improvements

  • Refactored to support multiple databases via EnumMap<DatabaseType, DatabaseReader>
  • Each database type is loaded independently and queried in sequence
  • Proper resource cleanup via Closeable implementation
  • Graceful error handling per-database (one failure doesn't block others)

Files Modified

  • src/plugin/index-geoip/ — plugin source, tests, dependencies, and config
  • build.xml — root build configuration
  • conf/nutch-default.xml — new GeoIP configuration properties
  • src/plugin/build.xml — plugin build configuration
  • src/plugin/indexer-solr/schema.xml — Solr schema field definitions

@lewismc
Copy link
Member Author

lewismc commented Jan 21, 2026

Most recent updates address a field duplication issue which could result when chaining multiple GeoIP databases.
Here's the example of running indexchecker

./runtime/local/bin/nutch indexchecker https://nutch.apache.org
...
accuracyRadius :	1000
isPublicProxy :	false
countryIsoCode :	US
cityNetworkAddress :	151.101.0.0/21
countryNetworkAddress :	151.101.0.0/21
countryGeoNameId :	6252001
autonomousSystemNumber :	54113
title :	Apache Nutch™
content :	Apache Nutch™
Apache Nutch™
Apache Nutch™
Community
Development
Docs
Download
News
The Apache Softwa
isHostingProvider :	false
isTorExitNode :	false
digest :	09f55cdd88bb9a668023f96143ec9605
host :	nutch.apache.org
id :	https://nutch.apache.org
isAnycast :	false
continentCode :	NA
isLegitimateProxy :	false
ip :	151.101.2.132
timeZone :	America/Chicago
isAnonymousVpn :	false
isResidentialProxy :	false
autonomousSystemOrganization :	FASTLY
url :	https://nutch.apache.org
isAnonymous :	false
tstamp :	Tue Jan 20 20:21:34 PST 2026
latLon :	37.751,-97.822
countryInEuropeanUnion :	false
continentGeoNameId :	6255149
countryName :	United States
continentName :	North America
asnNetworkAddress :	151.101.0.0/16

Required configuration

<property>
  <name>store.ip.address</name>
  <value>true</value>
  <description>Enables us to capture the specific IP address
  (InetSocketAddress) of the host which we connect to via the given
  protocol. Currently supported by: protocol-ftp, protocol-http,
  protocol-okhttp, protocol-htmlunit, protocol-selenium.  Note that
  the IP address is required by the plugin index-geoip and when
  writing WARC files.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|geoip)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  By default Nutch includes plugins to crawl HTML and various other
  document formats via HTTP/HTTPS and indexing the crawled content
  into Solr.  More plugins are available to support more indexing
  backends, to fetch ftp:// and file:// URLs, for focused crawling,
  and many other use cases.
  </description>
</property>

<property>
  <name>index.geoip.db.asn</name>
  <value>GeoLite2-ASN.mmdb</value>
  <description>
  GeoIP2/GeoLite2 ASN database file (MMDB format).
  Provides autonomous system number and organization information.
  </description>
</property>

<property>
  <name>index.geoip.db.city</name>
  <value>GeoLite2-City.mmdb</value>
  <description>
  GeoIP2/GeoLite2 City database file (MMDB format).
  Provides city, subdivision, country, continent, and location data.
  </description>
</property>

<property>
  <name>index.geoip.db.country</name>
  <value>GeoLite2-Country.mmdb</value>
  <description>
  GeoIP2/GeoLite2 Country database file (MMDB format).
  Provides country, continent, and represented country information.
  This is a lighter-weight alternative to the City database when only
  country-level information is needed.
  </description>
</property>

@lewismc lewismc closed this Jan 21, 2026
@lewismc lewismc reopened this Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant