Skip to content

Trie implementation #117

@jurraca

Description

@jurraca

While looking at the cov command recently, I realized that it performs extremely poorly with a large number of addresses (5000+) provided in the IP list argument. For example, passing one's rawaddrman addresses (40k+ addrs) would take upwards of 10 minutes to get a coverage check. Not good.

The reason is simple, and is fundamental to kartograf: looking up an IP address in a list of network prefixes n is of complexity O(n). Thus for looking up m number of IPs, the worst case would be m * n runtime.

Turns out the canonical way of looking up IP addresses in networks is via a trie. It's a binary tree, where each node represent a bit in the IP address. So you can "walk" down the bits in an IP address until you find the longest prefix matching, and see if it has an associated ASN. In short, for any IP lookup within a list of IP networks, we want to use a trie because the lookup is constant time. Building the trie remains O(m * n): you pay the upfront cost of building it, to search it for "free".

Benchmarking cov, I got the following results on my machine (32GB mem),
using some local data I had, with a fixed map file of 692_583 lines, and the corresponding with IP list length:

ip_list_len current_cov cov_with_trie
437 6s 8.5s
5407 1m14s 8.9s
65049 10m + 9.8s

I have three PRs in the works related to this:

There will be some implementation details to sort out as the behavior may change a bit, so I wanted to open this issue for discussion and visibility. Overall we will get better performance and nicer code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions