Skip to content

Document memory-efficient use of capture_http #187

@jcushman

Description

@jcushman

Since requests.get loads the whole response into RAM, capture_http is inefficient for large files:

# this uses 10GB of RAM:
with capture_http('example.warc.gz'):
    requests.get('https://example.com/#some_10GB_file')

Simply calling requests.get with stream=True doesn't result in archiving the file, only the headers:

# this doesn't work:
with capture_http('example.warc.gz'):
    requests.get('https://example.com/#some_10GB_file', stream=True)

Fetching and throwing away the data does work:

with capture_http('example.warc.gz'):
    response = requests.get('https://example.com/#some_10GB_file', stream=True)
    for _ in response.iter_content(chunk_size=2**16):
        pass

I haven't dug into why it's necessary to consume the stream -- is that inherent or accidental?

Either way, it might be nice to add one of these examples to the docs, since I think it's correct to do this any time you're not actually using the requests.get() response.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions