Skip to content

Commit 766e6ef

Browse files
authored
Reenable and improve preprocess dataset (#472)
## Summary This PR re-enables, tests, and documents the `preprocess dataset` command. Also changes the format that prompt and output sizes are specified, and makes the code aware of prefixes. ## Details - Uses the post-refactor code to re-enable the command. - Switches over to the same format used by `benchmark run`'s synthetic data for the data config to enable more features and make the command more cohesive with the rest of GuideLLM. - Adds options for prefixes. I added an option to include prefixes in the count, since prefixes are included in input and output tokens, and affect performance. ## Test Plan - Run with a known dataset, or create one as a simple CSV. - New tests are added that should cover everything except huggingface uploads. They are all at least in part generated by AI, but I went through each one iteratively to ensure they do what they need to do. --- - [x] "I certify that all code in this PR is my own, except as noted below." ## Use of AI - [x] Includes AI-assisted code completion - [x] Includes code generated by an AI application - [x] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`)
2 parents ce0ae7b + 88c2753 commit 766e6ef

File tree

17 files changed

+3250
-907
lines changed

17 files changed

+3250
-907
lines changed

docs/guides/datasets.md

Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,3 +212,237 @@ benchmark_generative_text(data=data, ...)
212212
- For lists of dictionaries, all items must have the same keys.
213213
- For lists of items, all elements must be of the same type.
214214
- A processor/tokenizer is only required if `GUIDELLM__PREFERRED_PROMPT_TOKENS_SOURCE="local"` or `GUIDELLM__PREFERRED_OUTPUT_TOKENS_SOURCE="local"` is set in the environment. In this case, the processor/tokenizer must be specified using the `--processor` argument. If not set, the processor/tokenizer will be set to the model passed in or retrieved from the server.
215+
216+
## Preprocessing Datasets
217+
218+
GuideLLM provides a preprocessing command that allows you to process datasets to have specific prompt and output token sizes. This is particularly useful when you need to standardize your dataset for benchmarking or when your dataset has prompts that don't match your target token requirements.
219+
220+
The preprocessing command can:
221+
222+
- Resize prompts to target token lengths
223+
- Handle prompts that are shorter or longer than the target length using various strategies
224+
- Map columns from your dataset to GuideLLM's expected column names
225+
- Generate output token counts based on your configuration
226+
- Save the processed dataset in various formats
227+
228+
### Basic Usage
229+
230+
```bash
231+
guidellm preprocess dataset \
232+
<DATA> \
233+
<OUTPUT_PATH> \
234+
--processor <PROCESSOR> \
235+
--config <CONFIG>
236+
```
237+
238+
### Required Arguments
239+
240+
| Argument | Description |
241+
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
242+
| `DATA` | Path to the input dataset or Hugging Face dataset ID. Supports all dataset formats documented in the [Dataset Configurations](../datasets.md). |
243+
| `OUTPUT_PATH` | Path to save the processed dataset, including file suffix (e.g., `processed_dataset.jsonl`, `output.csv`). |
244+
| `--processor` | **Required.** Processor or tokenizer name/path for calculating token counts. Can be a Hugging Face model ID or local path. |
245+
| `--config` | **Required.** Configuration specifying target token sizes. Can be a JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). |
246+
247+
### Example
248+
249+
```bash
250+
guidellm preprocess dataset \
251+
"path/to/input_dataset.jsonl" \
252+
"path/to/processed_dataset.jsonl" \
253+
--processor "gpt2" \
254+
--config "prompt_tokens=512,output_tokens=256,prefix_tokens_max=100"
255+
```
256+
257+
### Configuration and Processor Options
258+
259+
The `--config` parameter accepts a `PreprocessDatasetConfig` as a JSON string, key=value pairs, or a configuration file path (.json, .yaml, .yml, .config). This configuration is similar to the synthetic data configuration but includes additional fields specific to preprocessing.
260+
261+
**PreprocessDatasetConfig Options:**
262+
263+
- `prompt_tokens`: Average number of tokens in prompts. If nothing else is specified, all prompts will be resized to this number of tokens.
264+
- `prompt_tokens_stdev`: Standard deviation for prompt tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
265+
- `prompt_tokens_min`: Minimum number of tokens in prompts. If unset and `prompt_tokens_stdev` is set, the minimum is 1.
266+
- `prompt_tokens_max`: Maximum number of tokens in prompts. If unset and `prompt_tokens_stdev` is set, the maximum is 5 times the standard deviation.
267+
- `output_tokens`: Average number of tokens in outputs. If nothing else is specified, all outputs will have this number of tokens.
268+
- `output_tokens_stdev`: Standard deviation for output tokens. If not supplied and min/max are not specified, no deviation is applied. If not supplied and min/max are specified, a uniform distribution is used.
269+
- `output_tokens_min`: Minimum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the minimum is 1.
270+
- `output_tokens_max`: Maximum number of tokens in outputs. If unset and `output_tokens_stdev` is set, the maximum is 5 times the standard deviation.
271+
- `prefix_tokens_max`: Maximum number of prefix tokens to keep. If set, prefixes will be trimmed to this maximum length. If not set, prefixes are kept as-is (unless `--include-prefix-in-token-count` is used, which disables prefix trimming).
272+
273+
**Example configurations:**
274+
275+
```bash
276+
# Using key=value pairs
277+
--config "prompt_tokens=512,output_tokens=256,prefix_tokens_max=100"
278+
279+
# Using JSON string
280+
--config '{"prompt_tokens": 512, "output_tokens": 256, "prefix_tokens_max": 100}'
281+
282+
# Using a configuration file
283+
--config "path/to/config.json"
284+
```
285+
286+
The `--processor` argument specifies the tokenizer to use for calculating token counts. This is required because the preprocessing command needs to tokenize prompts to ensure they match the target token sizes. For information about using processors, including Hugging Face model IDs, local paths, and processor arguments, see the [Data Arguments Overview](../datasets.md#data-arguments-overview) section.
287+
288+
### Column Mapping
289+
290+
When your dataset uses non-standard column names, you can use `--data-column-mapper` to map your columns to GuideLLM's expected column names. This is particularly useful when:
291+
292+
1. **Your dataset uses different column names** (e.g., `question` instead of `prompt`, `instruction` instead of `text_column`)
293+
2. **You have multiple datasets** and need to specify which dataset's columns to use
294+
3. **Your dataset has system prompts or prefixes** in a separate column
295+
296+
**Column mapping format:** The `--data-column-mapper` accepts a JSON string mapping column types to column names:
297+
298+
```json
299+
{
300+
"text_column": "question",
301+
"prefix_column": "system_prompt",
302+
"prompt_tokens_count_column": "input_tokens",
303+
"output_tokens_count_column": "completion_tokens"
304+
}
305+
```
306+
307+
**Supported column types:**
308+
309+
- `text_column`: The main prompt text (defaults: `prompt`, `instruction`, `question`, `input`, `context`, `content`, `text`)
310+
- `prefix_column`: System prompt or prefix (defaults: `system_prompt`, `system`, `prefix`)
311+
- `prompt_tokens_count_column`: Column containing prompt token counts (defaults: `prompt_tokens_count`, `input_tokens_count`)
312+
- `output_tokens_count_column`: Column containing output token counts (defaults: `output_tokens_count`, `completion_tokens_count`)
313+
- `image_column`: Image data column
314+
- `video_column`: Video data column
315+
- `audio_column`: Audio data column
316+
317+
**Example: Mapping custom column names**
318+
319+
If your dataset has a CSV file with columns `user_query` and `system_message`:
320+
321+
```csv
322+
user_query,system_message
323+
"What is AI?","You are a helpful assistant."
324+
"How does ML work?","You are a technical expert."
325+
```
326+
327+
You would use:
328+
329+
```bash
330+
guidellm preprocess dataset \
331+
"dataset.csv" \
332+
"processed.jsonl" \
333+
--processor "gpt2" \
334+
--config "prompt_tokens=512,output_tokens=256" \
335+
--data-column-mapper '{"text_column": "user_query", "prefix_column": "system_message"}'
336+
```
337+
338+
**Example: Multiple datasets**
339+
340+
If you're working with multiple datasets and need to specify which dataset's columns to use, you can use the format `<dataset_index>.<column_name>` or `<dataset_name>.<column_name>`:
341+
342+
```bash
343+
--data-column-mapper '{"text_column": "0.prompt", "prefix_column": "1.system"}'
344+
```
345+
346+
### Handling Short Prompts
347+
348+
When prompts are shorter than the target token length, you can specify how to handle them using `--short-prompt-strategy`:
349+
350+
| Strategy | Description |
351+
| ------------- | ------------------------------------------------------------------------------ |
352+
| `ignore` | Skip prompts that are shorter than the target length (default) |
353+
| `concatenate` | Concatenate multiple short prompts together until the target length is reached |
354+
| `pad` | Pad short prompts with a specified character to reach the target length |
355+
| `error` | Raise an error if a prompt is shorter than the target length |
356+
357+
**Example: Concatenating short prompts**
358+
359+
```bash
360+
guidellm preprocess dataset \
361+
"dataset.jsonl" \
362+
"processed.jsonl" \
363+
--processor "gpt2" \
364+
--config "prompt_tokens=512,output_tokens=256" \
365+
--short-prompt-strategy "concatenate" \
366+
--concat-delimiter "\n\n"
367+
```
368+
369+
**Example: Padding short prompts**
370+
371+
```bash
372+
guidellm preprocess dataset \
373+
"dataset.jsonl" \
374+
"processed.jsonl" \
375+
--processor "gpt2" \
376+
--config "prompt_tokens=512,output_tokens=256" \
377+
--short-prompt-strategy "pad" \
378+
--pad-char " "
379+
```
380+
381+
### Additional Options
382+
383+
| Option | Description |
384+
| --------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
385+
| `--data-args <JSON>` | JSON string of arguments to pass to dataset loading. See [Data Arguments Overview](../datasets.md#data-arguments-overview) for details. |
386+
| `--include-prefix-in-token-count` | Include prefix tokens in prompt token count calculation (flag). When enabled, prefix trimming is disabled and the prefix is kept as-is. |
387+
| `--random-seed <NUMBER>` | Random seed for reproducible token sampling (default: 42). |
388+
| `--push-to-hub` | Push the processed dataset to Hugging Face Hub (flag). |
389+
| `--hub-dataset-id <ID>` | Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set). |
390+
391+
### Complete Examples
392+
393+
**Example 1: Basic preprocessing with custom column names**
394+
395+
```bash
396+
guidellm preprocess dataset \
397+
"my_dataset.csv" \
398+
"processed_dataset.jsonl" \
399+
--processor "gpt2" \
400+
--config "prompt_tokens=512,output_tokens=256" \
401+
--data-column-mapper '{"text_column": "user_question", "prefix_column": "system_instruction"}'
402+
```
403+
404+
**Example 2: Preprocessing with distribution and short prompt handling**
405+
406+
```bash
407+
guidellm preprocess dataset \
408+
"dataset.jsonl" \
409+
"processed.jsonl" \
410+
--processor "gpt2" \
411+
--config "prompt_tokens=512,prompt_tokens_stdev=50,output_tokens=256,output_tokens_stdev=25" \
412+
--short-prompt-strategy "concatenate" \
413+
--concat-delimiter "\n\n" \
414+
--random-seed 123
415+
```
416+
417+
**Example 3: Preprocessing with processor arguments and prefix token limits**
418+
419+
```bash
420+
guidellm preprocess dataset \
421+
"dataset.jsonl" \
422+
"processed.jsonl" \
423+
--processor "gpt2" \
424+
--processor-args '{"use_fast": false}' \
425+
--config "prompt_tokens=512,output_tokens=256,prefix_tokens_max=100" \
426+
--include-prefix-in-token-count
427+
```
428+
429+
**Example 4: Preprocessing and uploading to Hugging Face Hub**
430+
431+
```bash
432+
guidellm preprocess dataset \
433+
"my_dataset.jsonl" \
434+
"processed.jsonl" \
435+
--processor "gpt2" \
436+
--config "prompt_tokens=512,output_tokens=256" \
437+
--push-to-hub \
438+
--hub-dataset-id "username/processed-dataset"
439+
```
440+
441+
### Notes
442+
443+
- The `--config` parameter accepts a `PreprocessDatasetConfig` which includes all token count fields (prompt_tokens, output_tokens, etc.) plus `prefix_tokens_max` for controlling prefix length. See the [Configuration and Processor Options](#configuration-and-processor-options) section above for all available parameters.
444+
- The processor/tokenizer is required because the preprocessing command needs to tokenize prompts to ensure they match target token sizes. See the [Data Arguments Overview](../datasets.md#data-arguments-overview) for processor usage details.
445+
- Column mappings are only needed when your dataset uses non-standard column names. GuideLLM will automatically try common column names if no mapping is provided.
446+
- When using `--short-prompt-strategy concatenate`, ensure your dataset has enough samples to concatenate, or some prompts may be skipped.
447+
- The output format is determined by the file extension of `OUTPUT_PATH` (e.g., `.jsonl`, `.csv`, `.parquet`).
448+
- The prefix handling only trims prefixes. It doesn't expand them. Use `prefix_tokens_max` in the config to set a maximum prefix length, which will trim prefixes that exceed this limit.

src/guidellm/__main__.py

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@
3030
import click
3131
from pydantic import ValidationError
3232

33+
from guidellm.data import ShortPromptStrategy, process_dataset
34+
3335
try:
3436
import uvloop
3537
except ImportError:
@@ -486,6 +488,142 @@ def preprocess():
486488
"""Dataset preprocessing utilities."""
487489

488490

491+
@preprocess.command(
492+
"dataset",
493+
help=(
494+
"Process a dataset to have specific prompt and output token sizes. "
495+
"Supports multiple strategies for handling prompts and optional "
496+
"Hugging Face Hub upload.\n\n"
497+
"DATA: Path to the input dataset or dataset ID.\n\n"
498+
"OUTPUT_PATH: Path to save the processed dataset, including file suffix."
499+
),
500+
context_settings={"auto_envvar_prefix": "GUIDELLM"},
501+
)
502+
@click.argument(
503+
"data",
504+
type=str,
505+
required=True,
506+
)
507+
@click.argument(
508+
"output_path",
509+
type=click.Path(file_okay=True, dir_okay=False, writable=True, resolve_path=True),
510+
required=True,
511+
)
512+
@click.option(
513+
"--processor",
514+
type=str,
515+
required=True,
516+
help="Processor or tokenizer name for calculating token counts.",
517+
)
518+
@click.option(
519+
"--config",
520+
type=str,
521+
required=True,
522+
help=(
523+
"PreprocessDatasetConfig as JSON string, key=value pairs, "
524+
"or file path (.json, .yaml, .yml, .config). "
525+
"Example: 'prompt_tokens=100,output_tokens=50,prefix_tokens_max=10'"
526+
" or '{\"prompt_tokens\": 100, \"output_tokens\": 50, "
527+
"\"prefix_tokens_max\": 10}'"
528+
),
529+
)
530+
@click.option(
531+
"--processor-args",
532+
default=None,
533+
callback=cli_tools.parse_json,
534+
help="JSON string of arguments to pass to the processor constructor.",
535+
)
536+
@click.option(
537+
"--data-args",
538+
callback=cli_tools.parse_json,
539+
help="JSON string of arguments to pass to dataset creation.",
540+
)
541+
@click.option(
542+
"--data-column-mapper",
543+
default=None,
544+
callback=cli_tools.parse_json,
545+
help="JSON string of column mappings to apply to the dataset.",
546+
)
547+
@click.option(
548+
"--short-prompt-strategy",
549+
type=click.Choice([s.value for s in ShortPromptStrategy]),
550+
default=ShortPromptStrategy.IGNORE.value,
551+
show_default=True,
552+
help="Strategy for handling prompts shorter than target length.",
553+
)
554+
@click.option(
555+
"--pad-char",
556+
type=str,
557+
default="",
558+
callback=decode_escaped_str,
559+
help="Character to pad short prompts with when using 'pad' strategy.",
560+
)
561+
@click.option(
562+
"--concat-delimiter",
563+
type=str,
564+
default="",
565+
help=(
566+
"Delimiter for concatenating short prompts (used with 'concatenate' strategy)."
567+
),
568+
)
569+
@click.option(
570+
"--include-prefix-in-token-count",
571+
is_flag=True,
572+
default=False,
573+
help="Include prefix tokens in prompt token count calculation.",
574+
)
575+
@click.option(
576+
"--push-to-hub",
577+
is_flag=True,
578+
help="Push the processed dataset to Hugging Face Hub.",
579+
)
580+
@click.option(
581+
"--hub-dataset-id",
582+
type=str,
583+
default=None,
584+
help=("Hugging Face Hub dataset ID for upload (required if --push-to-hub is set)."),
585+
)
586+
@click.option(
587+
"--random-seed",
588+
type=int,
589+
default=42,
590+
show_default=True,
591+
help="Random seed for reproducible token sampling.",
592+
)
593+
def dataset(
594+
data,
595+
output_path,
596+
processor,
597+
config,
598+
processor_args,
599+
data_args,
600+
data_column_mapper,
601+
short_prompt_strategy,
602+
pad_char,
603+
concat_delimiter,
604+
include_prefix_in_token_count,
605+
push_to_hub,
606+
hub_dataset_id,
607+
random_seed,
608+
):
609+
process_dataset(
610+
data=data,
611+
output_path=output_path,
612+
processor=processor,
613+
config=config,
614+
processor_args=processor_args,
615+
data_args=data_args,
616+
data_column_mapper=data_column_mapper,
617+
short_prompt_strategy=short_prompt_strategy,
618+
pad_char=pad_char,
619+
concat_delimiter=concat_delimiter,
620+
include_prefix_in_token_count=include_prefix_in_token_count,
621+
push_to_hub=push_to_hub,
622+
hub_dataset_id=hub_dataset_id,
623+
random_seed=random_seed,
624+
)
625+
626+
489627
@cli.command(
490628
"mock-server",
491629
help=(

0 commit comments

Comments
 (0)