-
Notifications
You must be signed in to change notification settings - Fork 772
How To Crawl
- Requirements
- Work Area and Files
- Configuration
- Run the Crawl
- Check Crawl Status
- WARC Output Files
- Add WARCs to the Archive
- Troubleshooting
Crawl before you walk. – C. Darwin
To run a web crawl with Heritrix, you'll need the code (Java class files), config files, and adequate system resources:
- at least 4 GB of RAM
- enough local disk space for all the output WARC files
- as few competing processes as possible
This examples in this document are based on crawls run in a VM
named blub-dev
, with files named blub_
...something. Use your own
naming conventions.
blub@blub-dev:/1$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 4.0K 2.0G 1% /dev
tmpfs 396M 5.1M 391M 2% /run
/dev/vda1 28G 5.8G 21G 22% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 2.0G 0 2.0G 0% /run/shm
none 100M 0 100M 0% /run/user
/dev/vdb1 126G 41G 86G 33% /1
nfs-home:/home 7.1T 6.1T 990G 87% /home
The /1
directory has lots of space. We created a directory
called /1/crawling
and copied Heritrix code and config files,
originally created by John Lekashman for crawls of Cuban sites.
# cd /1
# mkdir crawling
# chown blub.users crawling
^D
$ cd /1/crawling
$ cp /home/lekash/blub.crap/* .
$ tar xvf cuba00000.tar
$ mv cuba0000 blub_test
We need a standard repo or other source of these code and config files.
The README
is not required, but is useful to document details of the
crawl that are not specified in configuration files.
$ cd blub_test
$ vi README
Now let's modify our configuration for this crawl.
$ mv cuba test_config
$ cd test_config
First we need a seed text file of URLs that will be the crawl origins.
In this case, we created the file mcloud_seeds.txt
from the RSS
file https://core.mediacloud.org/static/mc_20160317.rss:
$ head mcloud.seeds.txt
http://feeds.washingtonpost.com/c/34656/f/636663/s/4e55a97b/sc/7/l/0L0Swashingtonpost0N0Cblogs0Cright0Eturn0Cwp0C20A160C0A30C170Cwe0Ecant0Edefend0Edemocracy0Eby0Eelecting0Ea0Ethug0C/story01.htm
http://feeds.washingtonpost.com/c/34656/f/636663/s/4e56fd0f/sc/7/l/0L0Swashingtonpost0N0Cblogs0Cright0Eturn0Cwp0C20A160C0A30C170Ca0Elot0Eof0Erepublicans0Eare0Ein0Elindsey0Egrahams0Eshoes0C/story01.htm
http://www.news.com.au/sport/nrl/five-things-you-need-to-know-about-new-nrl-ceo-todd-greenberg/story-fndujljl-1227793059847?from=public_rss
http://www.abc.es/opinion/abci-secta-201603180423_noticia.html
...
Now we edit the config values:
$ mv cuba.cxml blub_test.cxml
$ vi blub_test.cxml
The Heritrix configuration is a complex combination of XML syntax and Java bean names. The choices depend on how wide and deep your crawl should be, whether to check duplicate files, and so on. This example is a fairly constrained crawl.
Edit the values shown below, and leave the other lines as is:
# (editable settings in blub_test.cxml)
# these will be displayed on the crawl dashboard:
metadata.jobName = blub_test
metadata.description = mediacloud_crawl
# output WARC file name prefix, by convention <= 6 chars, ALL CAPS:
warcwriter.prefix = MCLOUD
# Java write threads:
warcwriter.poolMaxActive = 6
# the "seed" file of URLs:
<property name="path" value="mcloud.seeds.txt">
# include this to enable deduplication:
<import resource="hq-beans.xml">
# set the hop count (how far to crawl):
<bean class="org.archive.modules.deciderules.TooManyHopsDecideRule">
<property name="maxHops" value="1" />
</bean>
# SURTs (normalized URLs) to avoid (this was one of the copied files):
<bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule">
<property name="surtsSourceFile" value="negative-surts.txt">
Rename and edit this file:
$ mv hq-beans.xml.0 hq-beans.xml
$ vi hq-beans.xml
Comment out the uriUniqFilter, hqclient, and hqpersist beans:
<!--
<bean id="uriUniqFilter"
class="org.archive.modules.hq.OnlineCrawlMapper">
<property name="nodeNo" value="0"/>
<property name="totalNodes" value="4"/>
<property name="discoveredBatchSize" value="500"/>
<property name="feedBatchSize" value="80"/>
<property name="localUniqFilter">
<bean class="org.archive.crawler.util.BloomUriUniqFilter">
<property name="bloomFilter">
<bean class="org.archive.util.BloomFilter64bit">
<constructor-arg value="250000000"/>
<constructor-arg value="30"/>
</bean>
</property>
</bean>
</property>
</bean>
-->
<!-- headquarter client adapter used by HttpPersistProcessor and OnlineCrawlMapper -->
<!--
<bean id="hqclient" class="org.archive.modules.hq.HttpHeadquarterAdapter">
</bean>
-->
<!-- finished submitter to be included in disposition chain -->
<!--
<bean id="hqpersist" class="org.archive.modules.hq.HttpPersistProcessor">
</bean>
-->
If all goes well, this will work:
$ ./startup.sh
A web dashboard will now be accessible at port 9443 on the same host
(get the auth name and password from ps -ef | grep java
):
{height="250"}
Click on the test_config link, then the blue build button at
the top. It should say Job Is Ready. Then click the blue launch
button. It should display a sequence of status strings after Job Is
Active: NASCENT
, then PREPARING
, then PAUSED
. Click the blue
unpause
button to (finally!) start the crawl.
The pause function depends on configuration settings that default to
True
:
<!-- <property name="pauseAtStart" value="true" /> -->
<!-- <property name="pauseAtFinish" value="true" /> -->
Refresh your browser to update the dashboard:
{height="250"}
This shows how many URLs have been processed and how many are pending, URL processing rates, thread status, and other data.
WARC files are written by default to the warcs
subdirectory. Each WARC
is stuffed with about a gigabyte of data before another is created:
$ ls -l warcs
total 41005888
-rw-r--r-- 1 blub users 1161204614 Jun 9 21:37 MCLOUD-20160609210332357-00000-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1064202832 Jun 9 21:56 MCLOUD-20160609210332359-00001-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1520977725 Jun 9 21:52 MCLOUD-20160609210332360-00002-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1011777223 Jun 9 21:24 MCLOUD-20160609210332361-00003-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000574144 Jun 9 21:52 MCLOUD-20160609210332362-00004-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000005845 Jun 9 21:51 MCLOUD-20160609210332363-00005-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000015034 Jun 9 22:15 MCLOUD-20160609212455312-00006-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000020459 Jun 9 22:36 MCLOUD-20160609213753231-00007-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1399930055 Jun 9 22:46 MCLOUD-20160609215107987-00008-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1292349285 Jun 9 23:37 MCLOUD-20160609215207090-00009-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000039653 Jun 9 23:32 MCLOUD-20160609215226776-00010-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000014926 Jun 9 22:29 MCLOUD-20160609215649483-00011-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000021659 Jun 10 01:31 MCLOUD-20160609221555291-00012-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000055761 Jun 10 02:23 MCLOUD-20160609222931584-00013-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000005175 Jun 10 02:18 MCLOUD-20160609223645636-00014-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 2119870356 Jun 10 00:56 MCLOUD-20160609224628152-00015-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1462329740 Jun 10 03:29 MCLOUD-20160609233227602-00016-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1551420349 Jun 10 01:45 MCLOUD-20160609233722737-00017-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000022245 Jun 10 05:31 MCLOUD-20160610005626706-00018-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1704015803 Jun 10 05:38 MCLOUD-20160610013124232-00019-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1511938554 Jun 10 02:58 MCLOUD-20160610014557756-00020-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1364297055 Jun 10 08:02 MCLOUD-20160610021836348-00021-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1013434840 Jun 10 08:21 MCLOUD-20160610022319581-00022-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1815742049 Jun 10 06:20 MCLOUD-20160610025853439-00023-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000058523 Jun 10 05:40 MCLOUD-20160610032921930-00024-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1096104506 Jun 10 10:18 MCLOUD-20160610053126093-00025-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1325952334 Jun 10 09:10 MCLOUD-20160610053804044-00026-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1004554334 Jun 10 15:06 MCLOUD-20160610054041702-00027-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1006686502 Jun 10 07:13 MCLOUD-20160610062044379-00028-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1262064822 Jun 10 09:47 MCLOUD-20160610071347444-00029-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1346247746 Jun 10 12:31 MCLOUD-20160610080214994-00030-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 1000015074 Jun 10 18:00 MCLOUD-20160610082131840-00031-5421~blub-dev.us.archive.org~9443.warc.gz
-rw-r--r-- 1 blub users 936592729 Jun 10 18:11 MCLOUD-20160610091017844-00032-5421~blub-dev.us.archive.org~9443.warc.gz.open
-rw-r--r-- 1 blub users 832670771 Jun 10 18:11 MCLOUD-20160610094801606-00033-5421~blub-dev.us.archive.org~9443.warc.gz.open
-rw-r--r-- 1 blub users 670126053 Jun 10 18:11 MCLOUD-20160610101843380-00034-5421~blub-dev.us.archive.org~9443.warc.gz.open
-rw-r--r-- 1 blub users 337385272 Jun 10 18:11 MCLOUD-20160610123108719-00035-5421~blub-dev.us.archive.org~9443.warc.gz.open
-rw-r--r-- 1 blub users 162329199 Jun 10 18:11 MCLOUD-20160610150641246-00036-5421~blub-dev.us.archive.org~9443.warc.gz.open
-rw-r--r-- 1 blub users 14747939 Jun 10 18:11 MCLOUD-20160610180030443-00037-5421~blub-dev.us.archive.org~9443.warc.gz.open
First, you need a collection to house your WARCs. If you need a new one,
ask someone (currently Jeff or Alexis) to make one for you. Then edit
dtmon.cfg
with values for (at least) crawljob
, job_dir
,
xfer_dir
, description
, operator
, and collections
:
# dtmon.cfg - draintasker "simple" config file
#
# * params must begin in col1, so preceding by anything
# effectively causes the parameter to be ignored
# * this file is read everytime it is needed, so your
# changes effect existing processes
# * please be verbose in your description_effort
# required keywords: CRAWLHOST, CRAWLJOB, START_DATE, END_DATE
#
# siznax 2010
crawljob: blub_test
job_dir: warcs
xfer_dir: sink
sleep_time: 300
max_size: 10
# WARC_naming 1 # {TLA}-{timestamp}-{serial}-{fqdn}.warc.gz
WARC_naming: 2 # {TLA}-{timestamp}-{serial}-{PID}~{fqdn}~{port}.warc.gz
block_delay: 120 # blocking
max_block_count: 120 # total blocking time = delay x count
retry_delay: 2400 # non-blocking
# ALL CAPS strings in the description below are substituted:
description: Internet Archive crawldata from Media Cloud, captured by CRAWLHOST:CRAWLJOB from START_DATE to END_DATE.
# Operator's email
operator: [email protected]
# These collections need to be precreated:
collections: # may have multiple, each on its own line
- mediacloud
title_prefix: Webwide Crawldata
creator: Internet Archive
sponsor: Internet Archive
contributor: Internet Archive
scanningcenter: sanfrancisco
derive: 1
compact_names: 1# Same as the parent directory:
Now run draintasker
to get your output WARCs into petabox collections.
There is some confusing documentation on draintasker:
Draintasker
Draintasker
Definitions
Draintasker
Use
CasesDraintasker
ConfigDraintasker
ExecutionDraintasker
OperationsDraintasker
ProcessingDraintasker
Output
Some of those docs plus trial and error revealed the following steps.
First, create a ~/.ias3cfg
file. Get the values of access_key
and
secret_key
from your ~/.config/internetarchve.yml
file. (Don't use
quotes around the key string values).
access_key = ...
secret_key = ...
Next:
$ cd /1/crawling/blub_test/site
$ touch warcs/DRAINME
$ ../draintasker/dtmon.py -L dtmon.log dtmon.cfg &
Files are taken from the warc
directory and processed into the sink
directory.
Input Files
blub_test.cxml
-- heritrix crawl config file
dtmod.cfg
– draintasker config file
hq-beans.xml
– heritrix ?
negative-surts.txt
– surts to avoid crawling, for one reason or
another
seeds.txt
– URL seed file
Output Files
negative-surts.dump
--
reports/frontier-summary-report.txt
--
warcs/*.warc.gz.open
--
warcs/*warc.gz
--
Log Files
job.log
--
logs/alerts.log
--
logs/crawl.log
--
logs/nonfatal-errors.log
--
logs/progress-statistics.log
--
logs/runtime-errors.log
--
logs/uri-errors.log
--
Screen Shot
2016-06-10 at 10.57.46 AM.png
(image/png)
Screen Shot
2016-06-10 at 11.08.21 AM.png
(image/png)
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse