StormCrawler allows to filter web pages and archive them into WARC archives, as follows:
WARCHdfsBolt warcbolt = (WARCHdfsBolt) new WARCHdfsBolt().withFileNameFormat(fileNameFormat);
TopologyBuilder builder = new TopologyBuilder();
builder.setBolt("warc", warcbolt, numWorkers)
.localOrShuffleGrouping("parse", WarcStreamName)
.localOrShuffleGrouping("tika", WarcStreamName);
Would it be possible to create a CDX index (or JCDX index) for the WARC archives at the same time?
StormCrawler allows to filter web pages and archive them into WARC archives, as follows:
Would it be possible to create a CDX index (or JCDX index) for the WARC archives at the same time?