[RFC/PROPOSAL]: Expanding Create-Workload with Synthetic Data Generation #759

IanHoang · 2025-02-25T20:45:30Z

Summary

Create-workload, a feature in OpenSearch Benchmark (OSB), currently extracts data from existing OpenSearch clusters and generates simplistic custom workloads. This proposal is intended to enhance create-workload to provide users flexibility and control on how data corpora are generated in custom workloads. The team plans to add more options for data extraction and also provide a mechanism for users to synthetically generate data based on index mappings. This will enable users to build larger custom representative workloads to benchmark their use-cases.

To avoid overcrowding this RFC, a separate RFC will focus on adding more data extraction methods while this one focuses on adding synthetic data generation to OSB.

Motivation and Stakeholders
See this RFC's What are we doing? section for motivation and stakeholders

Background on Create-Workload
See this RFC's Current Design of Create-Workload section for background context.

Problem Statement
OpenSearch users and developers are experiencing interconnected pain points when creating custom workloads:

Representation: OpenSearch users and developers face challenges in building large workloads because the create-workload feature requires pre-existing large volumes of data. Also, many times users want smaller representative workloads.
Scalability: Even if users have pre-existing large volumes of data that can be used in a custom workload, extracting that data is arduous because of how create-workload only has a single extraction method that is prone to failure. Users are also limited by the size of the workloads and are unable to create workloads on the order of terabytes. This is because there is a lack of tools to be able to generate such large workloads synthetically.
Privacy: Many OpenSearch users are hesitant in using actual production data in custom workloads because it contains sensitive or proprietary information.

All these pain points contribute to a central issue, which is that users are unable to build realistic custom workloads that accurately model production metric patterns because they lack the tools and guidance to do so.

Proposal

To resolve the problems above, the OSB team proposes addressing the data aspect of creating custom workloads by incorporating a synthetic data generator into OSB. Additionally, the team plans to add new methods of extracting existing data from OpenSearch clusters in create-workload.

These new features would give users more control on how data corpora is obtained when building custom workloads.

A synthetic data generator in OSB will allow users to generate production-like data without compromising sensitive information. It will also enable users to create large-scale workloads without relying on pre-existing cluster data.
More extraction methods in create-workload would make the process of extracting data more efficient and less cumbersome.

Synthetic Data Generator will be added as a new module in OSB and its workflow can be invoked independently, work with create-workload, and integrate with future OSB initiatives (such as anonymization and streaming with real-time data generation).

The additional extraction methods will added after synthetic data generation’s implementation and will extend from create-workload’s existing architecture. These extraction methods are explored in a separate RFC.

With these new features, users will be empowered with tools and guidance necessary to build custom scalable workloads that model production environments, which are needed to improve benchmarking capabilities and gain insight into cluster reliability.

User Stories

As an OpenSearch user and developer, I want to generate data based on index mappings so that I do not need to use actual data from my production environment when building custom workloads
As an Managed Service Operator, I want to be able to generate data corpora for my custom workload based off of index mappings I provide so that I can reproduce production issues.
As an OpenSearch developer, I would like to generate large-scale workloads with data corpora on the order of terabytes so that I can run benchmarks at scale
As an OpenSearch user, I would like to create custom workloads tailored to my use-case so I can answer performance questions related to my cluster’s configurations and catch regressions before they happen.

Assumptions

Synthetic Data Generation will work with existing create-workload logic but can also be used separately from create-workload
Synthetic Data Generation will produce data corpora structured similarly to those in pre-packaged workloads. This means that the workload’s data should be configured as a list of JSON documents.
Synthetic Data Generation should afford users some modicum of control on how specific fields are populated.

Synthetic Data Generation Module High-Level Design

Figure 1: Closer look at new Synthetic Data Generation module.

Users must provide an index mapping or a template document, or both, to Synthetic Data Generator (SDG). Additional optional inputs users can provide is a cluster profile (outputted by a new API) and custom data generators. These inputs will be parsed and validated and will be relayed to the SDG which will fetch the appropriate Data Generators. SDG will provision a number of worker processes based on the number of cores in the LG host and these processes will be responsible using the collected data generators to generate documents.

By default, generated documents will be placed in a queue and a single thread will be responsible for writing these documents to the output file. A single thread writer can help prevent race conditions and conflicts when writing documents. To avoid memory bottlenecks and if there's preference for speed over order of documents, users also will have the option to have a writer for each worker process. During this entire process, the user is informed through the CLI on how the document generation process is progressing. SDG is periodically updating the checkpoint file with information so that if the generation process ever gets interrupted, users can restart where they left off.

Components in this module: Components from the above diagram are further described in the following bullets.

Related to User Input:
- Index Mappings and / or Template Document: Users must provide either an index mappings or template document. Index mappings would inform SDG which fields correspond to which mapping field types while the template document allows users to label which specific data generators should be used for which fields. The latter offers more granular control for how data should be generated. Having both index mappings and template document enables the program to gain a more comprehensive understanding of how data should be generated.
- Workload Profile: Output from an OpenSearch or OSB API that captures stats and characteristics from a cluster’s workload. This would be similar to how a query is profiled in order to help the user understand query performance and understand how a specific query is executed. This tool could sample a number of documents and gives information that should guide SDG on how to build the data corpora or give guidance on how users can replicate the workload. This output could include index shard sizes and distribution of values seen across index fields.
- Mapping Parser and Template Parser: These parsers will parse index mappings and the template document. It will send the orchestrator necessary information it needs to assemble the right components to generate the documents.
Related to Data Generation:
- Synthetic Data Generator (SDG): Acts as the orchestrator since it takes in the user’s inputs, manages the state of data generation. It instantiates other components — such as data generators that are related to the supplied index mappings and/or template document, worker processes based on the LG host’s cores, FileWriter, ProgressMonitor, and queue.
- Data Generators: Core building blocks that generate random data for basic mapping field types in OpenSearch or for common OpenSearch use-cases. OSB will come with a wide range of pre-packaged data generators. However, users can supply their own defined data generators if none of the pre-packaged data generators fit their use-case.
- FileChunkWriter: By default, we can use a single file writer (using a single thread). A single file writer will be responsible for taking documents from the queue and writing them to the output file. Once a certain file size is reached, the SDG can compress the existing file with OSB’s compression module while the FileWriter writes to a new file. This is to ensure that data is easily downloadable and meets the download limits of services like S3 and CloudFront, both of which OSB uses for storing and downloading its pre-packaged workloads’ corpora. Since we are using a single thread to collect documents from a queue and write documents, we are avoiding write conflicts and preserving order of documents written. However, if users do not care about order of documents, SDG can provision a writer for each worker process that writes to their own respective files. These files can be post-processed and later combined.
- Progress Monitor: Informs users of the progress and rate at which data is being generated
- Checkpoint File: This file will be periodically updated in case the process of generating or writing data ever fails, users can resume where they left off. It keeps track of the number of docs that have been created, the size of the corpora generated thus far, and original target number of docs or target size of data corpora. The checkpoint file will be updated whenever the writer has successfully written a batch of documents to the output file.
- Output: Synthetic data generated documents will be written to JSON file(s), which is the standard way OSB has its data corpora structured in pre-packaged workloads. These files are chunks or “parts” of corpora, and when combined, can represent a single index.

For an example walkthrough, see Walkthrough for Synthetic Data Generation Process in the appendix.

For example user inputs, see Example Synthetic Data Generation User Inputs in the appendix.

Questions:

Multiprocessing should allow us to effectively generate random data. However, some users have documents containing fields like dates and timestamps and require documents be written in a specific order. How can we ensure that this happens with multiple processes?
- With this modular approach, we can offer users with three different choices on how data is written to the disk.
  - Default Queue: All workers write to one queue and one writer thread is responsible for taking documents from the queue and writing it to disk. We can recommend this approach to users who are generating small to medium data corpora.
  - For users who need to preserve the order of docs with specific fields like timestamps or dates, there will be an option to swap the default queue with a priority queue. Again, this is suitable for small to medium data corpora. For larger data corpora, users will need to understand that there are limitations in speed and can have impacts on memory. Might need to look into disk-based queue.
  - In cases where order does not matter and users prefer speed when generating large data corpora, users have the option to remove the queue and have each worker use their own writer to write to separate files. Post processing can be done to combine chunks of files. For large data corpora, we should use this approach.
- This implementation allows for generating large volumes of data efficiently, with the flexibility to choose between different strategies based on the specific requirements (order vs. speed).
Is this scalable beyond a single machine?
- We will eventually try to support multiple LG hosts when generating data. This can be done with a library like Dask. However, we will need to restrict the users to only use this if document order does not matter
- To make it simple, checkpointing files will be made for each machine

Pros and Cons with this Approach

Pros:

Scalability: handles both small (10GB) and large-scale (10+TB) data tasks as well as various use-cases
Reliability: checkpoint system allows users to resume where they left off if process ever gets interrupted
Flexibility: Users have diverse choices and control of how randomized data is generated. They also canchange the way the data is written to improve performance
Simplicity: Quick setup and results
Maintainability: modular design allows for easy maintenance, testing, and can evolve over time.
Lower learning curve and can be easily integrated into OSB and other features
Can leverage parallel-computing libraries like Dask that are well-suited for general-purpose distributed computing task. This will also allow us to go beyond generating on a single machine to speed up generation of data. This has been explored in a POC where Dask outperformed the other libraries at generating Big5 documents.

Cons

Limited data transformation capabilities that data pipelines have
Not suitable for complex workflows or multi-step processes (that data pipelines usually are used for).

Appendix

Walkthrough for Synthetic Data Generation Process

Here are two example walkthroughs. The first is a walkthrough of when a user provides only an index mapping to synthetic data generator and the second is a walkthrough of when a user provides both an index mapping and a template document.

When users only provide index mapping:
Let’s say a user provides the following index mapping to Synthetic Data Generator

# Index Mappings Example
{
  "mappings": {
    "properties": {
      "dropoff_datetime": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      },
      "trip_type": {
        "type": "keyword"
      },
      "passenger_count": {
        "type": "integer"
      },
      "pickup_datetime": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      },
      "tolls_amount": {
        "scaling_factor": 100,
        "type": "scaled_float"
      },
      "tip_amount": {
        "type": "half_float"
      },
      "extra": {
        "scaling_factor": 100,
        "type": "scaled_float"
      },
      },
      "dropoff_location": {
        "type": "geo_point"
      },
      "vendor_name": {
        "type": "text"
      },
      "total_amount": {
        "scaling_factor": 100,
        "type": "scaled_float"
      },
      "currency": {
        "type": "text"
      }
    },
    "dynamic": "strict"
  }
}

Synthetic Data Generator will analyze which fields have which mapping field types. Synthetic Data Generator will use this information to grab basic data generators such as StringGenerator, NumericGenerator, DateGenerator, etc. These basic data generators will be based off of thesupported field typeslisted in the OpenSearch documentation. These Data Generators also come with specific parameters which allow users to customize the randomness. For example, if the user wants to limit the DateGenerator to a specific range of dates to randomly pick from, they can provide that limitation as a parameter.

Each worker process will use the collected data generators to populate the document structure. These generated documents will be added to a queue, where a single a thread will write them to the output files. If the output file reaches a certain size, the writer thread will stop adding documents to that file and move to a new output file. This is to ensure that the output file(s) can be compressed to a certain size and meet download standards of services like CloudFront and S3 (which is where OSB’s pre-packaged workloads’ data corpora are currently stored).

When users provide index mapping and template document:

Building off the previous example, say a user provides the same index mapping and a template document that includes a blurb that specifies that currency field must use a specific data generator, called CURRENCY, that comes with OSB:

# Template Document Example
# In addition to the basic data generators that are for[ supported field types ](https://opensearch.org/docs/latest/field-types/supported-field-types/index/)in OpenSearch, 
# OSB comes with specific data generators that are for common use-cases. Currency is one of those data generators.
      "currency": {{ CURRENCY('USD') }}

Synthetic Data Generator will parse both index mappings and search for any specific data generators and their parameters in the template document. It will validate that the return type of requested data generator matches that of the index mappings. If the data generator’s return type matches that of the index mappings, it will grab that specific data generator for that field. In this example, it detects that the user specified CURRENCY data generator. Synthetic Data Generator confirms that the return type of the CURRENCY data generator is a string, which matches the mapping field type currency field in the index mappings. For the rest of the fields that do not include a specific data generator specified in {{}}, SDG will automatically use basic data generators based on the field types. The rest of the process is identical to the previous section of When users only provide index mappings.

Example Synthetic Data Generation User Inputs

Here are example details of inputs that SDG can take in:

# Index mapping example
{
  "mappings": {
    "properties": {
      "pickup_datetime": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      },
      "trip_type": {
        "type": "keyword"
      },
      "passenger_count": {
        "type": "integer"
      },
      "tip_amount": {
        "type": "half_float"
      },
      "dropoff_location": {
        "type": "geo_point"
      },
      "vendor_name": {
        "type": "text"
      },
      "total_amount": {
        "scaling_factor": 100,
        "type": "scaled_float"
      },
      "currency": {
        "type": "text"
      }
    },
    "dynamic": "strict"
  }
}

If users only provide the index mappings, SDG will grab basic data generators (such as NumericGenerator, FloatGenerator, and TextGenerator) to generate random data without restrictions.

# Template Document
{
  "pickup_datetime": {{ DATETIME(["2023-09-01", "2023-12-01"], ["09:00:00", "17:00:00"], "ISO-8601") }}
  "trip_type": "city-limits",
  "passenger_count": 1,
  "tip_amount": {{ NUMERIC(1.0, 22.0,'float') }}
  "dropoff_location": [
    -73.91363525390625,
    40.76552200317383
  ],
  "vendor_name": "New York's Finest Cabs",
  "total_amount": 79.94,
  "currency": {{ CURRENCY('USD') }}
}

The template document above is one that could be provided to SDG. SDG would recognize that there are three specific data generators that user wants to use:
- DATETIME: User has specified that the random values generated should be between 2023-09-01 and 2023-12-01 and should be between the times of 9AM to 5PM. The output should also be in ISO-8601 format.
- NUMERIC: User has specified that the tip_amount field should have random values from 1.0 to 22.0 and should be returned as a float.
- CURRENCY: User has specified that all of the values in currency field should be “USD”.
For all other fields, SDG will do its best to detect the mapping field type and assign a basic data generators based on them.
If a user also provided an index mapping, SDG will use that index mapping to validate the return type of the data generator the user has specified and determine the other mapping field types.

# Workload Profile Output
{
    "indices": [
        {
            "name": "nyc_taxis",
            "primary_shard_size": "19GB",
            "mappings": {
                "properties": {
                  "pickup_datetime": {
                    "type": "date",
                    "format": "yyyy-MM-dd HH:mm:ss"
                  },
                  "trip_type": {
                    "type": "keyword"
                  },
                  "passenger_count": {
                    "type": "integer"
                  },
                  "tip_amount": {
                    "type": "half_float"
                  },
                  "dropoff_location": {
                    "type": "geo_point"
                  },
                  "vendor_name": {
                    "type": "text"
                  },
                  "total_amount": {
                    "scaling_factor": 100,
                    "type": "scaled_float"
                  },
                  "currency": {
                    "type": "text"
                  }
                },
                "dynamic": "strict"
            },
            "field_distributions": {
                "sample_size": 10000,
                "pickup_datetime": {
                  "type": "date",
                  "min": "2023-01-01 00:00:00",
                  "max": "2023-12-31 23:59:59",
                  "distribution": [
                    {"range": "2023-01-01 to 2023-03-31", "count": 2500, "percentage": 25},
                    {"range": "2023-04-01 to 2023-06-30", "count": 3000, "percentage": 30},
                    {"range": "2023-07-01 to 2023-09-30", "count": 2800, "percentage": 28},
                    {"range": "2023-10-01 to 2023-12-31", "count": 1700, "percentage": 17}
                  ]
                },
                "trip_type": {
                  "type": "keyword",
                  "disctinct_value_count": 3,
                  "avg_value_length": 7,
                  "distribution": [
                    {"value": "standard", "count": 7000, "percentage": 70},
                    {"value": "premium", "count": 2500, "percentage": 25},
                    {"value": "shared", "count": 500, "percentage": 5}
                  ]
                },
                "passenger_count": {
                  "type": "integer",
                  "min": 1,
                  "max": 6,
                  "mean": 2.3,
                  "median": 2,
                  "mode": 1,
                  "distribution": [
                    {"value": 1, "count": 4500, "percentage": 45},
                    {"value": 2, "count": 3000, "percentage": 30},
                    {"value": 3, "count": 1500, "percentage": 15},
                    {"value": 4, "count": 700, "percentage": 7},
                    {"value": 5, "count": 200, "percentage": 2},
                    {"value": 6, "count": 100, "percentage": 1}
                  ]
                },
                "tip_amount": {
                  "type": "half_float",
                  "min": 0.0,
                  "max": 50.0,
                  "mean": 8.75,
                  "median": 7.5,
                  "distribution": [
                    {"range": "0.0 - 5.0", "count": 3000, "percentage": 30},
                    {"range": "5.1 - 10.0", "count": 4000, "percentage": 40},
                    {"range": "10.1 - 15.0", "count": 2000, "percentage": 20},
                    {"range": "15.1 - 20.0", "count": 700, "percentage": 7},
                    {"range": "20.1+", "count": 300, "percentage": 3}
                  ]
                },
                "dropoff_location": {
                  "type": "geo_point",
                  "bounding_box": {
                    "top_left": {"lat": 40.9176, "lon": -74.2590},
                    "bottom_right": {"lat": 40.4774, "lon": -73.7004}
                  },
                  "hotspots": [
                    {"location": {"lat": 40.7128, "lon": -74.0060}, "count": 2000, "percentage": 20},
                    {"location": {"lat": 40.7484, "lon": -73.9857}, "count": 1500, "percentage": 15}
                  ]
                },
                "vendor_name": {
                  "type": "text",
                  "unique_values": 10,
                  "top_values": [
                    {"value": "Yellow Cab", "count": 4000, "percentage": 40},
                    {"value": "Green Taxi", "count": 3000, "percentage": 30},
                    {"value": "Uber", "count": 2000, "percentage": 20}
                  ]
                },
                "total_amount": {
                  "type": "scaled_float",
                  "min": 5.00,
                  "max": 150.00,
                  "mean": 35.75,
                  "median": 30.00,
                  "distribution": [
                    {"range": "5.00 - 20.00", "count": 2000, "percentage": 20},
                    {"range": "20.01 - 40.00", "count": 4000, "percentage": 40},
                    {"range": "40.01 - 60.00", "count": 2500, "percentage": 25},
                    {"range": "60.01 - 80.00", "count": 1000, "percentage": 10},
                    {"range": "80.01+", "count": 500, "percentage": 5}
                  ]
                },
                "currency": {
                  "type": "text",
                  "unique_values": 2,
                  "distribution": [
                    {"value": "USD", "count": 9500, "percentage": 95},
                    {"value": "KRW", "count": 500, "percentage": 5}
                  ]
                }
            }
        }
    ]
}

This shows how the API would sample 10,000 documents from an index called nyc_taxis and output the following stats on it. SDG can use these stats to gauge how it should generate random data to be more like the production data stats. Users can also alter the number of documents that this API samples. Outside of SDG, users and stakeholders can still find value in an API like this to get insight into their cluster's workload.

How Can You Help?

Any general comments about the overall direction are welcome.
Provide early feedback by testing the new workload features as they become available.
Help out on the implementation! Check out the issues page for work that is ready to be picked up.

Next Steps

We will incorporate feedback and add more details on design, implementation and prototypes as they become available.

The text was updated successfully, but these errors were encountered:

IanHoang added RFC Request for comment on major changes untriaged labels Feb 25, 2025

opensearch-infra bot added this to OpenSearch Roadmap Feb 25, 2025

github-project-automation bot moved this to New in OpenSearch Roadmap Feb 25, 2025

IanHoang self-assigned this Feb 25, 2025

IanHoang removed the untriaged label Feb 25, 2025

IanHoang added this to Engineering Effectiveness Board Feb 25, 2025

IanHoang moved this to 🏗 In progress in Engineering Effectiveness Board Feb 25, 2025

IanHoang moved this from New to In Progress in OpenSearch Roadmap Feb 25, 2025

IanHoang added this to OpenSearch Benchmark Roadmap Feb 25, 2025

IanHoang moved this to This Quarter in OpenSearch Benchmark Roadmap Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC/PROPOSAL]: Expanding Create-Workload with Synthetic Data Generation #759

[RFC/PROPOSAL]: Expanding Create-Workload with Synthetic Data Generation #759

IanHoang commented Feb 25, 2025 •

edited

Loading

[RFC/PROPOSAL]: Expanding Create-Workload with Synthetic Data Generation #759

[RFC/PROPOSAL]: Expanding Create-Workload with Synthetic Data Generation #759

Comments

IanHoang commented Feb 25, 2025 • edited Loading

Summary

Proposal

Synthetic Data Generation Module High-Level Design

Appendix

Walkthrough for Synthetic Data Generation Process

Example Synthetic Data Generation User Inputs

How Can You Help?

Next Steps

IanHoang commented Feb 25, 2025 •

edited

Loading