Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample: add time-series sampling options #2589

Open
jqnatividad opened this issue Mar 9, 2025 · 0 comments
Open

sample: add time-series sampling options #2589

jqnatividad opened this issue Mar 9, 2025 · 0 comments
Labels
datapusher+ for Datapusher+ DRUF for Data Resource Upload First workflow enhancement New feature or request. Once marked with this label, its in the backlog. timeseries time series related

Comments

@jqnatividad
Copy link
Collaborator

A lot of data hosted in data catalogs are time-series data.

Since we're focusing on just compiling high-quality, high-resolution metadata in the data catalog, we compile summary stats and frequency tables using the "complete" dataset, but only want to host a representative sample in the catalog while pointing to the source where the "complete" dataset is available.

We don't want the catalog to double as a central datastore with its attendant high capacity reqts, so we only store a sample preview.

However, if we just get the first N rows of a time series dataset, it will most likely always be the same as time-series datasets are often sorted.

Add several time-series sampling options to make the sample more dynamic:

  • time-based systematic sampling - ability to specify sampling windows/intervals (1 hour, daily, weekly, etc).
  • enhanced starting point - ability to start from the end (most recent observations)
  • adaptive sampling/seasonal awareness - e.g. sample more frequently during business days/hours, weekends, seasons, months
  • aggregation options - aggregate data within each interval to downsample high-frequency observations
@jqnatividad jqnatividad added enhancement New feature or request. Once marked with this label, its in the backlog. datapusher+ for Datapusher+ DRUF for Data Resource Upload First workflow timeseries time series related labels Mar 9, 2025
@jqnatividad jqnatividad changed the title sample: add time-series sampling handling options sample: add time-series sampling options Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datapusher+ for Datapusher+ DRUF for Data Resource Upload First workflow enhancement New feature or request. Once marked with this label, its in the backlog. timeseries time series related
Projects
None yet
Development

No branches or pull requests

1 participant