-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Currently we connect to parserdb through SSH tunnels, and run export.py on our local machines to export the dataset. This is usable to some extend, usually takes < 30 minutes to export the whole dataset.
We can improve this with the following scheme:
- Export newly scrapped contents on a daily basis. We exports datasets in CSV and JSONLines formats which are easy to stream and merge.
- We can send the exports to S3-like storage, or just get more Linodes to serve these files publicly. If I recall correctly the last export has ~160k entries and took about ~450MB. That's about 2 months of data. I think we can safely assume that a year's worth of data (exports) will be less then 10GB.
Metadata
Metadata
Assignees
Labels
No labels