Skip to content

Conversation

@mmisiewicz
Copy link

Hello! I've been using ArchiveSpark with the CommonCrawl files stored on S3. I found a few items that needed small fixes and I thought I'd send in a PR. I wouldn't say this is 100% ready to
merge - not sure if there any any automated tests to run - but I have been using the code with
these modifications for a few weeks without issues.

Commit message follows:
This change makes a few modifications to the HDFS utils. Importantly,
the FileSystem objects from the hadoop libraries are retrieved from
the URI of the files. This will allow accessing CommonCrawl WARC files
on filesystems other than the currently configured one in the HadoopConf.

Additionally there is a small fix for some sometimes corrupted WARC records
encountered in the output from CommonCrawl.

This change makes a few modifications to the HDFS utils. Importantly,
the `FileSystem` objects from the hadoop libraries are retrieved from
the URI of the files. This will allow accessing CommonCrawl WARC files
on filesystems other than the currently configured one in the HadoopConf.

Additionally there is a small fix for some sometimes corrupted WARC records
encountered in the output from CommonCrawl.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants