-
-
Notifications
You must be signed in to change notification settings - Fork 64
Open
Description
Since requests.get
loads the whole response into RAM, capture_http is inefficient for large files:
# this uses 10GB of RAM:
with capture_http('example.warc.gz'):
requests.get('https://example.com/#some_10GB_file')
Simply calling requests.get with stream=True doesn't result in archiving the file, only the headers:
# this doesn't work:
with capture_http('example.warc.gz'):
requests.get('https://example.com/#some_10GB_file', stream=True)
Fetching and throwing away the data does work:
with capture_http('example.warc.gz'):
response = requests.get('https://example.com/#some_10GB_file', stream=True)
for _ in response.iter_content(chunk_size=2**16):
pass
I haven't dug into why it's necessary to consume the stream -- is that inherent or accidental?
Either way, it might be nice to add one of these examples to the docs, since I think it's correct to do this any time you're not actually using the requests.get()
response.
Metadata
Metadata
Assignees
Labels
No labels