-
-
Notifications
You must be signed in to change notification settings - Fork 64
Open
Description
I'm using Python 3.10.4 and warcio 1.7.4
Using a piece of code based on https://github.com/webrecorder/warcio#writing-warc-records, I'm getting
for record in ArchiveIterator(writer.get_stream()):
AttributeError: 'WARCWriter' object has no attribute 'get_stream'. Did you mean: '_iter_stream'?
import os.path
import hashlib
from warcio.capture_http import capture_http
from warcio.archiveiterator import ArchiveIterator
import requests #https://github.com/webrecorder/warcio#writing-warc-records
from bs4 import BeautifulSoup
#https://gist.github.com/edsu/62bc39890806ffd19b597186a3619419
OUTPUT_PATH = 'output/'
def cache_and_return_bs(url):
if url_already_retrieved(url):
raise Exception(url + ' already there')
with capture_http(get_output_filename(url),warc_version='1.1') as writer:
#TODO: do we want to try to append to a single file?
requests.get(url)
for record in ArchiveIterator(writer.get_stream()):
if record.rec_type == 'response':
return BeautifulSoup(record.raw_stream)
def get_output_filename(url):
return OUTPUT_PATH + hashlib.sha256(url.encode()).hexdigest()
def url_already_retrieved(url):
return os.path.isfile(get_output_filename(url))
if __name__ == '__main__':
print(cache_and_return_bs('https://example.org'))
I narrowed this down to specifying a filename in the writer object - if I don't do this, the get_stream method exists
Metadata
Metadata
Assignees
Labels
No labels