Skip to content

Come up with a better export process #6

@pm5

Description

@pm5

Currently we connect to parserdb through SSH tunnels, and run export.py on our local machines to export the dataset. This is usable to some extend, usually takes < 30 minutes to export the whole dataset.

We can improve this with the following scheme:

  • Export newly scrapped contents on a daily basis. We exports datasets in CSV and JSONLines formats which are easy to stream and merge.
  • We can send the exports to S3-like storage, or just get more Linodes to serve these files publicly. If I recall correctly the last export has ~160k entries and took about ~450MB. That's about 2 months of data. I think we can safely assume that a year's worth of data (exports) will be less then 10GB.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions