Skip to content

Implicit max. value of URI cost and precedence (?) should raise warning if exceeded #502

@cgr71ii

Description

@cgr71ii

Hi!

I've been doing some observations in the code of Heritrix related to the URI precedence, what led me to the URI cost, and I think that is something I'm missing or there should be some kind of warning for the situation I observed. I've observed that with the default URI precedence policy, CostUriPrecedencePolicy, the precedence of the URI is assigned to the cost of the URI. I guess that the precedence of a URI is the way a URI is prioritized in the assigned queue when a URI is gathered to crawl, and the cost is the necessary "effort" for crawling that URI (used for things like rotate among the different queues). So, in CostUriPrecedencePolicy:

public void uriScheduled(CrawlURI curi) {
    curi.setPrecedence(curi.getHolderCost()); 
}

Now, the only way in the code that I've observed that getPrecedence() from class CrawlURI is used is in calculateInsertKey(CrawlURI) from class DbdMultipleWorkQueues. There, the URI precedence is used for building a binary value where some bytes of it is the URI precedence, but only 8 bits, 1 signed byte, from the URI precedence is used (in the code comment it says 1 signed byte, but I'm not sure where the sign bit is being used, and I think it's not being used at all):

// FROM THE COMMENT OF THE FUNCTION: "NOTE: Dangers here are:  (1) priorities or precedences over 2^7 (signed byte comparison)"
long precedence = Math.min(curi.getPrecedence(), 127);

On the other hand, when I was observing the code, I was not very sure that class DbdMultipleWorkQueues is the class that is always used for prioritize the URIs in the queues based on they precedence value. Because of this, I looked if the cost was the actual value being used for what I understand that is the precedence value (i.e. the metric used for prioritizing which URIs should be crawled next), and I found that getHolderCost() was only being used in processFinish(CrawlURI) from class WorkQueueFrontier, and with that method name doesn't sound like something that is being used all the time for prioritizing, but in class CrawlURI I fount out:

/** spot for an integer cost to be placed by external facility (frontier).
 *  cost is truncated to 8 bits at times, so should not exceed 255 */
protected int holderCost = UNCALCULATED;

The cost of the URI, according to the comment of the variable, should not exceed 255. Well, the only place where this value is used for something is in processFinish(CrawlURI) from class WorkQueueFrontier:

// ...
int holderCost = curi.getHolderCost();
// ...
wq.expend(holderCost); // all retries but DEFERRED cost
// ...
holderCost = 0; // no charge for disregarded URIs
// ...
wq.expend(holderCost); // successes & failures charge cost to queue

So, in this above code snippet I don't see any reason why the URI cost should not exceed 255, what leads to my guesses (of course, I might be wrong since I haven't dived as deep as might be necessary in the code in order to understand these concepts, that's why I detailed that much all the issue, sorry :/):

  1. The URI cost might exceed 255 without any problem, but that would affect the default, which is not the only one, URI precedence policy (i.e. class CostUriPrecedencePolicy). This makes that the comment of "not exceed 255" is a comment which makes the URI cost being dependent of a specific URI precedence policy. If this is true, is just that the comment, I think, shouldn't be there, but in class CostUriPrecedencePolicy since is the one using the URI cost in order to assign the precedence.
  2. I understand that the value 255 of the comment is making reference to 8 bits, but it seems that it's talking about an unsigned int value, but if this comment is based on the limit of 8 bits from class DbdMultipleWorkQueues, which is what I think, then this is a signed int value. I mean, it seems that the cost might be set with a value higher than 127, but if this is done, then the value 128 would be the value 0 if we use a signed int value, 129 would be -1, and so on... But still I see that Math.min is used in class DbdMultipleWorkQueues with a max. value of 127, so I'm not pretty sure about this point, since it seems that is discarding the sign bit.
  3. I think there should be a warning somewhere if the precedence is higher than the maximum value of the URI precedence value, since if a different cost URI assignment policy is being used (not class UnitCostAssignmentPolicy), if the cost URI policy is the default (i.e. class CostUriPrecedencePolicy), the cost of the cost URI assignment should not exceed 127 if I'm not wrong.

Have I missed something or all the explained is true and some kind of explicit warning is necessary if a cost or precedence exceeds the implicit maximum values? I know, many guesses, it might all be something weird in my head and nothing all of this makes any sense at all. If that's the case, I'm truly sorry :(

If something of this is right, I've implemented a custom cost URI assignment policy, whose maximum cost value is 10000, is this value being clipped to 255 or 127 or -254? :'(

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions