Skip to content

Documentation: Clarify that capture_http writer with filename has no get_stream methood #143

@voltagex

Description

@voltagex

I'm using Python 3.10.4 and warcio 1.7.4

Using a piece of code based on https://github.com/webrecorder/warcio#writing-warc-records, I'm getting

    for record in ArchiveIterator(writer.get_stream()):
AttributeError: 'WARCWriter' object has no attribute 'get_stream'. Did you mean: '_iter_stream'?
import os.path
import hashlib

from warcio.capture_http import capture_http
from warcio.archiveiterator import ArchiveIterator
import requests #https://github.com/webrecorder/warcio#writing-warc-records
from bs4 import BeautifulSoup

#https://gist.github.com/edsu/62bc39890806ffd19b597186a3619419

OUTPUT_PATH = 'output/'

def cache_and_return_bs(url):
    if url_already_retrieved(url):
        raise Exception(url + ' already there')
    with capture_http(get_output_filename(url),warc_version='1.1') as writer:
        #TODO: do we want to try to append to a single file?
        requests.get(url)
        for record in ArchiveIterator(writer.get_stream()):
            if record.rec_type == 'response':
                return BeautifulSoup(record.raw_stream)

def get_output_filename(url):
    return OUTPUT_PATH + hashlib.sha256(url.encode()).hexdigest()

def url_already_retrieved(url):
    return os.path.isfile(get_output_filename(url))


if __name__ == '__main__':
    print(cache_and_return_bs('https://example.org'))

I narrowed this down to specifying a filename in the writer object - if I don't do this, the get_stream method exists

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions