[RFC/PROPOSAL]: Adding More Extraction Methods to Create-Workload #760

IanHoang · 2025-02-25T21:06:19Z

What are you proposing?

Summary

Create-workload, a feature in OpenSearch Benchmark (OSB), currently extracts data from existing OpenSearch clusters and generates simplistic custom workloads. This proposal is intended to enhance create-workload to provide users flexibility and control on how data corpora are generated in custom workloads. The team plans to add more options for data extraction and also provide a mechanism for users to synthetically generate data based on index mappings. This will enable users to build larger custom representative workloads to benchmark their use-cases.

To avoid overcrowding this RFC, this RFC is focused on adding new extraction methods to create-workload while a separate RFC focuses on adding synthetic data generation to OSB.

The work proposed here can be done after the synthetic data generator has been implemented.

Motivation and Stakeholders
See this RFC's What are we doing? section for motivation and stakeholders

Problem Statement
OpenSearch users and developers are experiencing interconnected pain points when creating custom workloads:

Representation: OpenSearch users and developers face challenges in building large workloads because the create-workload feature requires pre-existing large volumes of data. Also, many times users want smaller representative workloads.
Scalability: Even if users have pre-existing large volumes of data that can be used in a custom workload, extracting that data is arduous because of how create-workload only has a single extraction method that is prone to failure. Users are also limited by the size of the workloads and are unable to create workloads on the order of terabytes. This is because there is a lack of tools to be able to generate such large workloads synthetically.
Privacy: Many OpenSearch users are hesitant in using actual production data in custom workloads because it contains sensitive or proprietary information.

All these pain points contribute to a central issue, which is that users are unable to build realistic custom workloads that accurately model production metric patterns because they lack the tools and guidance to do so.

Current Design of Create-Workload

The current iteration of create-workload is straightforward. Users can specify a number of indices and in turn this feature will extract all documents present in those indices and generate an OSB workload that users can run. Users can optionally attach queries or OSB will automatically provide a few common operations that are present in all pre-packaged workloads.

Figure 1: Current create-workload workflow
OSB users have also experienced the create-workload gaps laid out in this RFC.

Current State of Extracting Data

Currently, OSB only offers one way of extracting data which is through a match-all scan and scroll query. This method creates a snapshot of the index state and allows OSB to paginate through large result sets. It sacrifices the order of the documents in favor of retrieving them more efficiently.

This extraction strategy can be strained, especially when indices are larger than 200GB. When the scan and scroll is slow, users may experience read timeouts and the extraction process fails, resulting in incomplete workloads. Incomplete workloads force the user to build the rest of the workload by hand, which is a cumbersome and frustrating process. Additionally, the extraction process might have been close to finishing but due to the failure, users have to decide whether they should use the incomplete data corpora or start the extraction process over.

Proposal

To resolve the problems above, the OSB team proposes addressing the data aspect of creating custom workloads by incorporating a synthetic data generator into OSB. Additionally, the team plans to add new methods of extracting existing data from OpenSearch clusters in create-workload.

These new features would give users more control on how data corpora is obtained when building custom workloads.

A synthetic data generator in OSB will allow users to generate production-like data without compromising sensitive information. It will also enable users to create large-scale workloads without relying on pre-existing cluster data.
More extraction methods in create-workload would make the process of extracting data more efficient and less cumbersome.

Synthetic Data Generator will be added as a new module in OSB and its workflow can be invoked independently, work with create-workload, and integrate with future OSB initiatives (such as anonymization and streaming with real-time data generation). Synthetic data generation is explored in a separate RFC.

The additional extraction methods will added after synthetic data generation’s implementation and will extend from create-workload’s existing architecture.

With these new features, users will be empowered with tools and guidance necessary to build custom scalable workloads that model production environments, which are needed to improve benchmarking capabilities and gain insight into cluster reliability.

User Stories

As an OpenSearch user and developer, I want precise control over documents extracted from my production data in order to build a custom workload
As an OpenSearch user, I would like to create custom workloads tailored to my use-case so I can answer performance questions related to my cluster’s configurations and catch regressions before they happen.

Assumptions

Create-workload should be able to swap out its simplistic extraction method (scan and scroll query) with other extraction techniques. The method of extraction can be determined by the user and should be based on their workload type and size.

Add More Data Extraction Strategies

Similar to how OpenSearch has various ways to query data, create-workload should not rely on a single way to extract data. Create-workload can leverage different types of data extraction methods based on the type of dataset it is extracting. Some examples to illustrate this:

For documents with timestamps, create-workload can leverage that field, which will make it easier to resume if extraction failed. This is ideal for smaller corpora though as it’s easier to sort documents by timestamp and keep track of the order, which in turns make it easier to resume if the extraction process fails.
For larger datasets, users can leverage composite aggregations, as composite aggregations can be more efficient and put less strain on OpenSearch clusters
Create-workload can provide a mechanism for users to specify custom routines for modifying documents (e.g. anonymization, modifying specific fields, etc.) being extracted on-the-fly will help them to customize those further.
To move hundreds of GB of data easily, we can leverage Data Prepper underneath to create a connection from one cluster to another cluster and filter or anonymize the data.
Can potentially leverage reading documents by stream API that is in development. @rishabhmaurya is working on this feature, which would be a light-weight method compared to scan and scroll pagination. However, it is currently not ready in the opensearch-py client yet. Once this feature is ready, this can be another method that allows users to easily extract data from their clusters.

Users can list all extraction strategies and choose one that’s optimal for their use-case.

Add More Document Selection Strategies

Several users do not want the entire index, especially if it is very large. Yet, users still want to capture the essence of the index by having documents from throughout the index. To address this, create-workload can incorporate a variety of ways to select documents. This can include:

Fetching multiples of documents (e.g. grab every other document or every 5th document)
Fetch a number of documents (e.g. grab only 1M documents)
Fetch a total size of documents (e.g. grab 100GB worth of documents)
Filter documents precisely, which will enable users to extract a customized workload for scenarios they may be specifically interested in.

Many users have been interested in this but this capability is currently not available to them.

Similar to data extraction strategies, users can list all document selection strategies and choose one that’s optimal for their use-case.

Implement Checkpointing Strategies

When extraction of documents fail, there should be a way for users to safely resume extraction through checkpointing mechanisms. Resuming extractions with the default extraction strategy, scan and scroll, is not possible since it sacrifices order of documents returned. But this can potentially be done with other data extraction strategies implemented, especially with ones that have ways to detect which documents have already been ingested (whether through sorted documents, timestamps, or memory of doc IDs).

How Can You Help?

Any general comments about the overall direction are welcome.
Provide early feedback by testing the new workload features as they become available.
Help out on the implementation! Check out the issues page for work that is ready to be picked up.

Next Steps

We will incorporate feedback and add more details on design, implementation and prototypes as they become available.

IanHoang added RFC Request for comment on major changes untriaged labels Feb 25, 2025

opensearch-infra bot added this to OpenSearch Roadmap Feb 25, 2025

github-project-automation bot moved this to New in OpenSearch Roadmap Feb 25, 2025

IanHoang self-assigned this Feb 25, 2025

IanHoang removed the untriaged label Feb 25, 2025

IanHoang added this to OpenSearch Benchmark Roadmap and Engineering Effectiveness Board Feb 25, 2025

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Feb 25, 2025

IanHoang moved this to 🏗 In progress in OpenSearch Benchmark Roadmap Feb 25, 2025

IanHoang moved this from 🏗 In progress to Next Quarter in OpenSearch Benchmark Roadmap Feb 25, 2025

This was referenced Feb 25, 2025

[RFC/PROPOSAL]: Expanding Create-Workload with Synthetic Data Generation #759

Open

Create-Workload Improvements #463

Closed

[META] Create-Workload Enhancements #616

Closed

[META] SDG & New Extraction Methods #750

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC/PROPOSAL]: Adding More Extraction Methods to Create-Workload #760

[RFC/PROPOSAL]: Adding More Extraction Methods to Create-Workload #760

IanHoang commented Feb 25, 2025 •

edited

Loading

[RFC/PROPOSAL]: Adding More Extraction Methods to Create-Workload #760

[RFC/PROPOSAL]: Adding More Extraction Methods to Create-Workload #760

Comments

IanHoang commented Feb 25, 2025 • edited Loading

What are you proposing?

Summary

Current Design of Create-Workload

Current State of Extracting Data

Proposal

Add More Data Extraction Strategies

Add More Document Selection Strategies

Implement Checkpointing Strategies

How Can You Help?

Next Steps

IanHoang commented Feb 25, 2025 •

edited

Loading