Skip to content
Alex Osborne edited this page Jul 4, 2018 · 3 revisions

Logs

Each crawl job has its own set of log files.

Logs are found in the "logs" directory, which exists under the directory of a specific job. The location of specific log files are provided in the "Configuration-referenced paths" section of the job page.

Logging Properties

Logging properties can be set by modifying the logging.properties file that is located under the ./conf directory.  For information on using logging properties, visit http://logging.apache.org/log4j/.

Log Files
alerts.log

This log contains alerts that indicate problems with a crawl.

crawl.log

Each URI that Heritrix attempts to fetch will cause a log line to be written to the crawl.log file.  Below is a two line extract from the log.

2011-06-23T17:12:08.802Z   200       1299 http://content-5.powells.com/robots.txt LREP http://content-5.powells.com/cgi-bin/imageDB.cgi?isbn=9780385518635 text/plain #014 20110623171208574+225 sha1:YI
UOKDGOLGI5JYHDTXRFFQ5FF4N2EJRV - -
2011-06-23T17:12:09.591Z   200      15829 http://www.identitytheory.com/etexts/poetics.html L http://www.identitytheory.com/ text/html #025 20110623171208546+922 sha1:7AJUMSDTOMT4FN7MBFGGNJU3Z56MLCMW
- -

Field Name

Description

Timestamp

The timestamp in ISO8601 format, to millisecond resolution.  The time is the instant of logging.

Fetch Status Code

Usually this is the HTTP response code but it can also be a negative number if URI processing was unexpectedly terminated.

Document Size

The size of the downloaded document in bytes.  For HTTP, this is the size of content only.  The size excludes the HTTP response headers.  For DNS, the size field is the total size for the DNS response.

Downloaded URI

The URI of the document downloaded.

Discovery Path

The breadcrumb codes (discovery path) showing the trail of downloads that lead to the downloaded URI.  As of Heritrix 3.1, the length of the discovery path has been limited to the last 50 hop-types.  For example, a 62-hop path might now appear as "12+LLRLLLRELLLLRLLLRELLLLRLLLRELLLLRLLLRELLLLRLLLRELE".  This enhancement decreases the size of the log and limits memory usage.

The breadcrumb codes are as follows.

R

Redirect

E

Embed

X

Speculative embed (aggressive/Javascript link extraction)

L

Link

P

Prerequisite (as for DNS or robots.txt before another URI)

Referrer

The URI that immediately preceded the downloaded URI.  This is the referrer.  Both the discovery path and the referrer will be empty for seed URIs.

Mime Type

The downloaded document mime type.

Worker Thread ID

The id of the worker thread that downloaded the document.

Fetch Timestamp

The timestamp in RFC2550/ARC condensed digits-only format indicating when the network fetch was started.  If appropriate the millisecond duration of the fetch is appended to the timestamp with a "+" character as separator.

SHA1 Digest

The SHA1 digest of the content only (headers are not digested).

Source Tag

The source tag inherited by the URI, if source tagging is enabled.

Annotations

If an annotation has been set, it will be displayed.  Possible annotations include: the number of times the URI was tried, the literal "lenTrunc" if the download was truncanted due to exceeding configured size limits, the literal "timeTrunc" if the download was truncated due to exceeding configured time limits or "midFetchTrunc" if a midfetch filter determined the download should be truncated.

warc

The name of the WARC/ARC file to which the crawled content is written.  This value will only be written if the logExtraInfo property of the loggerModule bean is set to true.  This logged information will be written in JSON format.

progress-statistics.log

This log is written by the StatisticsTracker bean.  At configurable intervals, a log line detailing the progress of the crawl is written to this file.

Field Name

Description

timestamp

Timestamp in ISO8601 format indicating when the log line was written.

discovered

Number of URIs discovered to date.

queued

Number of URIs currently queued.

downloaded

Number of URIs downloaded to date.

doc/s(avg)

Number of document downloaded per second since the last snapshot.  The value in parenthesis is measured since the crawl began.

KB/s(avg)

Amount in kilobytes downloaded per second since the last snapshot.  The value in parenthesis is measured since the crawl began.

dl-failures

Number of URIs that Heritrix has failed to download.

busy-thread

Number of toe threads busy processing a URI.

mem-use-KB

Amount of memory in use by the Java Virtual Machine.

heap-size-KB

The current heap size of the Java Virtual Machine.

congestion

The congestion ratio is a rough estimate of how much initial capacity, as a multiple of current capacity, would be necessary to crawl the current workload at the maximum rate available given politeness settings.  This value is calculated by comparing the number of internal queues that are progressing against those that are waiting for a thread to become available.

max-depth

The size of the Frontier queue with the largest number of queued URIs.

avg-depth

The average size of all the Frontier queues.   

runtime-errors.log

This log captures unexpected exceptions and errors that occur during the crawl. Some may be due to hardware limitations (out of memory, although that error may occur without being written to this log), but most are probably due to software bugs, either in Heritrix's core but more likely in one of its pluggable classes.

uri-errors.log

This log stores errors that resulted from attempted URI fetches.  Usually the cause is non-existent URIs.  This log is usually only of interest to advanced users trying to explain unexpected crawl behavior.

frontier.recover.gz

The frontier.recover.gz file is a gzipped journal of Frontier events. It can be used to restore the Frontier after a crash.

Heritrix

Structured Guides:

Wiki index

FAQs

User Guide

Knowledge Base

Known Issues

Background Reading

Users of Heritrix

How To Crawl

Development

Clone this wiki locally