While looking at the cov command recently, I realized that it performs extremely poorly with a large number of addresses (5000+) provided in the IP list argument. For example, passing one's rawaddrman addresses (40k+ addrs) would take upwards of 10 minutes to get a coverage check. Not good.
The reason is simple, and is fundamental to kartograf: looking up an IP address in a list of network prefixes n is of complexity O(n). Thus for looking up m number of IPs, the worst case would be m * n runtime.
Turns out the canonical way of looking up IP addresses in networks is via a trie. It's a binary tree, where each node represent a bit in the IP address. So you can "walk" down the bits in an IP address until you find the longest prefix matching, and see if it has an associated ASN. In short, for any IP lookup within a list of IP networks, we want to use a trie because the lookup is constant time. Building the trie remains O(m * n): you pay the upfront cost of building it, to search it for "free".
Benchmarking cov, I got the following results on my machine (32GB mem),
using some local data I had, with a fixed map file of 692_583 lines, and the corresponding with IP list length:
| ip_list_len |
current_cov |
cov_with_trie |
| 437 |
6s |
8.5s |
| 5407 |
1m14s |
8.9s |
| 65049 |
10m + |
9.8s |
I have three PRs in the works related to this:
There will be some implementation details to sort out as the behavior may change a bit, so I wanted to open this issue for discussion and visibility. Overall we will get better performance and nicer code.
While looking at the
covcommand recently, I realized that it performs extremely poorly with a large number of addresses (5000+) provided in the IP list argument. For example, passing one'srawaddrmanaddresses (40k+ addrs) would take upwards of 10 minutes to get a coverage check. Not good.The reason is simple, and is fundamental to kartograf: looking up an IP address in a list of network prefixes n is of complexity O(n). Thus for looking up m number of IPs, the worst case would be m * n runtime.
Turns out the canonical way of looking up IP addresses in networks is via a trie. It's a binary tree, where each node represent a bit in the IP address. So you can "walk" down the bits in an IP address until you find the longest prefix matching, and see if it has an associated ASN. In short, for any IP lookup within a list of IP networks, we want to use a trie because the lookup is constant time. Building the trie remains O(m * n): you pay the upfront cost of building it, to search it for "free".
Benchmarking
cov, I got the following results on my machine (32GB mem),using some local data I had, with a fixed map file of 692_583 lines, and the corresponding with IP list length:
I have three PRs in the works related to this:
covcommand Use trie in cov command #116mergecommand Use trie in merge process #123There will be some implementation details to sort out as the behavior may change a bit, so I wanted to open this issue for discussion and visibility. Overall we will get better performance and nicer code.