Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions examples/sldbench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# SLDBench — Scaling Law Discovery Benchmark

## Introduction

**SLDBench** is a benchmark for discovering scaling laws, originally introduced in the paper [*Can Language Models Discover Their Own Scaling Laws?*](https://arxiv.org/abs/2507.21184) by Lin et al. It aggregates over **5,000 LLM training experiments** from recent scaling-law literature into a unified dataset, hosted on the Hugging Face Hub at [`pkuHaowei/sldbench`](https://huggingface.co/datasets/pkuHaowei/sldbench).

Also check this [blog](https://algorithmicsuperintelligence.ai/blog/openevolve-sldagent/) for quickly understanding OpenEvolve x SLDBench.

## Overview

SLDBench focuses on **discovery** rather than simple curve fitting. The agent must identify:

- A **symbolic law** $f_\theta(x)$ (the functional form).
- A **parameter fitting routine** that generalizes across multiple training scenarios.

**Key Features:**

- **Data Source:** All task data is pulled dynamically from the Hugging Face dataset.
- **Extrapolation Evaluation:** Models are trained on smaller-scale runs and strictly evaluated on held-out, larger-scale configurations to test predictive capability.
- **Evolutionary Loop:** OpenEvolve iteratively mutates and evaluates candidate implementations of `scaling_law_func(...)` (the symbolic law) and `fit_scaling_law(...)` (the optimizer).

------

## SLDBench Tasks

There are currently 7 core scaling-law discovery tasks, each derived from real-world LLM experiments. Configuration files for these tasks are located in `examples/sldbench/configs/`.

| **Task Name (Config)** | **Scenario** | **Inputs (X)** | **Target (y)** |
| -------------------------------- | -------------------------------------------- | ---------------------------------------------------- | ------------------------------ |
| **parallel_scaling_law** | Parallel / Best-of-N inference scaling. | Model size $N$, Parallelism $P$ | Loss $L(N, P)$ |
| **vocab_scaling_law** | Vocabulary size vs. model/data scaling. | Non-vocab size $N$, Vocab size $V$, Dataset size $D$ | Unigram-normalized loss $L$ |
| **sft_scaling_law** | Supervised Fine-Tuning (SFT). | SFT dataset size $D$ | Fine-tuning loss $L(D)$ |
| **domain_mixture_scaling_law** | Multi-domain pre-training mixtures. | Domain mixture proportions $r$ | Per-domain losses $\{L_i(r)\}$ |
| **moe_scaling_law** | Mixture-of-Experts (MoE) scaling. | Network size $N$, Experts $E$ | Pre-training loss $L(N, E)$ |
| **data_constrained_scaling_law** | Data-constrained pre-training regimes. | Model size $N$, Dataset size $D$, Unique tokens $U$ | Loss $L(N, D, U)$ |
| **lr_bsz_scaling_law** | Joint Learning Rate / Batch Size (Step Law). | LR $l$, Batch size $b$, Dataset $D$, Model $N$ | Loss $L(l, b, D, N)$ & Optima |

> **Note:** A task named `easy_question_scaling_law` is also included for U-shape scaling studies, though it is not part of the current paper reference.

------

## File Structure

- `configs/` — YAML configuration files defining data splits, features, targets, and evaluation settings for each task.
- `data_loader.py` — Unified data loader for the `pkuHaowei/sldbench` Hugging Face dataset.
- `evaluator.py` — The evaluation framework; handles data splitting (train/extrapolate) and metric computation.
- `init_program.py` — The seed implementation (a power-law–style baseline) to jumpstart the evolutionary search.

------

## Usage

### Configuration Prerequisites

Before running any tasks, ensure your API key environment variable is set for your API provider (e.g., `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`).

### Running Individual Tasks

To run the evolutionary process for a specific task:

```bash
python openevolve-run.py \
examples/sldbench/init_program.py \
examples/sldbench/evaluator.py \
--config examples/sldbench/configs/sft_scaling_law.yaml \
--api-base "https://api.openai.com/v1" \
--iterations 50
```

To switch tasks, simply point the `--config` argument to a different YAML file found in `examples/sldbench/configs/`.

### Automated Benchmark Script

A complete benchmark script `run.sh` is provided for running multiple tasks across different models with parallelism:

```bash
cd examples/sldbench
chmod +x run.sh
./run.sh 4 # Run with parallelism degree of 4 per model
```

**Note**: Configure the `API_BASE` variable in the script (defaults to `https://api.openai.com/v1`) and ensure your API key environment variable is set.

The script handles evolution and evaluation automatically, storing results in `./results/`.

### Important: Test Set Evaluation

**Note:** The `openevolve-run` command only evaluates programs on the **training set** during evolution. To compute final metrics on the **test set**, you must explicitly run:

```bash
python evaluator.py "path/to/generated_program.py"
```

The `evaluator.py` script, when run in `__main__` mode, computes metrics on the held-out extrapolation test set, which is the proper way to evaluate the discovered scaling laws' predictive capability.

------

## Data Format & Evaluation

Each task is formulated as a scaling-law discovery problem containing:

1. **Features ($X$):** Input variables (e.g., $N, D, \text{LR}, \text{Batch Size}$).
2. **Targets ($y$):** Performance metrics (typically training or validation loss).
3. **Groups:** Control indices representing distinct experimental settings (e.g., different model architectures) that share the law *form* but require distinct fitted *parameters*.

### The Evaluation Process

1. **Splitting:** The evaluator partitions data into **training** and **extrapolation test** sets. The largest models or datasets are explicitly held out to mirror real-world forecasting needs.
2. **Fitting:** The `fit_scaling_law` function optimizes parameters on the training portion for each group.
3. **Scoring:** The fitted law is applied to the test set to compute the following metrics:

- **NMSE:** Normalized Mean Squared Error
- **NMAE:** Normalized Mean Absolute Error
- **$R^2$:** Coefficient of Determination
- **Combined Score:** A single scalar summary (currently equivalent to $R^2$).

*Higher combined scores indicate superior extrapolation quality.*

------

## Evolution Markers

OpenEvolve modifies code explicitly wrapped in evolution blocks. The agent evolves the symbolic form and the optimizer simultaneously:

Python

```
# EVOLVE-BLOCK-START
def scaling_law_func(data_points, params):
# Returns predicted values given inputs and parameters
pass

def fit_scaling_law(data_points, loss_values):
# Optimizes parameters to fit the scaling law
pass
# EVOLVE-BLOCK-END
```

The system mutates these blocks, evaluates them via `evaluator.py`, and maintains a database of the highest-performing implementations.

## Requirements

Bash

```
pip install datasets numpy scipy
# Ensure the latest version of openevolve is installed
```

## Citation

If you utilize SLDBench, this example, or derived results in your work, please cite the original paper:

```
@article{lin2025sldbench,
title = {Can Language Models Discover Scaling Laws?},
author = {Lin, Haowei and Ye, Haotian and Feng, Wenzheng and Huang, Quzhe and
Li, Yujun and Lim, Hubert and Li, Zhengrui and Wang, Xiangyu and
Ma, Jianzhu and Liang, Yitao and Zou, James},
journal = {arXiv preprint arXiv:2507.21184},
year = {2025}
}
88 changes: 88 additions & 0 deletions examples/sldbench/configs/data_constrained_scaling_law.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Configuration for data constrained scaling law discovery with OpenEvolve
max_iterations: 50
checkpoint_interval: 1
log_level: "INFO"
random_seed: 42

# LLM configuration
llm:
primary_model: null
primary_model_weight: 1.0
secondary_model: null
secondary_model_weight: 0.0
api_base: ""
max_tokens: 16384
timeout: 240
retries: 10
retry_delay: 10

# Prompt configuration
prompt:
system_message: |
You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between training data characteristics and model loss under data-constrained conditions.

**IMPORTANT: The scaling law function must use no more than 7 parameters.**

Focus on mathematical accuracy across different data scales, cross-dataset generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability.

**DATA CHARACTERISTICS (182 total data points):**
- Features: [unique_tokens, params, tokens] - 3D input
- Labels: loss - scalar output
- Dataset size: 161
- Parameter range (P): 1.1e8 to 1.1e9 (100M to 1.1B parameters)
- Token count range (D): 1e9 to 1e12 tokens
- Unique tokens range: 1e7 to 5e8 unique tokens
- Loss range: 1.8 to 7.2 (cross-entropy loss)
- Model architectures: Transformer variants with different parameterizations
- Data explores scaling under token/unique-token constraints

The function signatures must remain:

```python
def scaling_law_func(data_points, params):
# data_points: (N,3) array with columns [unique_tokens, params, tokens]
# tokens: Array of token counts
# params: Array of parameter counts
# unique_tokens: Array of unique token counts
# params: Array of up to 7 parameters
# Returns: Predicted loss values

def fit_scaling_law(data_points, loss_values):
# data_points: (N,3) array with columns [unique_tokens, params, tokens]
# loss_values: Array of corresponding loss values
# Returns: Optimized parameters (up to 7 parameters)
```

Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers.

You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc.

num_top_programs: 3
num_diverse_programs: 2
use_template_stochasticity: true

# Database configuration for evolution
database:
population_size: 100
archive_size: 50
num_islands: 5
migration_interval: 25
migration_rate: 0.1
elite_selection_ratio: 0.1
exploration_ratio: 0.2
exploitation_ratio: 0.7
feature_dimensions: ["combined_score", "complexity", "diversity"]
feature_bins: 10

# Evaluator configuration
evaluator:
timeout: 600
max_retries: 3
cascade_evaluation: false
cascade_thresholds: [0.3, 0.6]
parallel_evaluations: 4
use_llm_feedback: false

# Evolution settings
diff_based_evolution: false
max_code_length: 100000
85 changes: 85 additions & 0 deletions examples/sldbench/configs/domain_mixture_scaling_law.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Configuration for domain mixture scaling law discovery with OpenEvolve
max_iterations: 50
checkpoint_interval: 1
log_level: "INFO"
random_seed: 42

# LLM configuration
llm:
primary_model: null
primary_model_weight: 1.0
secondary_model: null
secondary_model_weight: 0.0
api_base: ""
max_tokens: 16384
timeout: 240
retries: 10
retry_delay: 10

# Prompt configuration
prompt:
system_message: |
You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between domain mixture proportions and multi-domain loss values across different model sizes.

**IMPORTANT: The scaling law function must use no more than 35 parameters.**

Focus on mathematical accuracy across different model sizes, cross-domain generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability.

**DATA CHARACTERISTICS**
- Features: Domain proportions (5 domains) - array of shape (n_mixtures, 5)
- Labels: Multi-domain losses (5 domains) - array of shape (n_mixtures, 5)
- Dataset size: 80 training (20 per model size)
- Model parameter sizes: 70M, 160M, 410M, 1B parameters (4 separate groups)
- Domain proportions: Each row sums to 1.0 (mixture weights)
- Loss ranges: Domain losses span 1.8-4.2 cross-entropy loss
- Mixture configurations: Systematic exploration of different domain weight combinations
- This is a multi-output regression problem with correlated domain performances

The function signatures must remain:

```python
def scaling_law_func(data_points, params):
# data_points: (N,5) array with domain proportions for 5 domains
# proportions: Array of domain mixture proportions
# params: Array of up to 35 parameters
# Returns: Predicted multi-domain loss values (N,5)

def fit_scaling_law(data_points, loss_values):
# data_points: (N,5) array with domain proportions for 5 domains
# loss_values: Array of corresponding multi-domain losses (N,5)
# Returns: Optimized parameters (up to 35 parameters)
```

Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers.

You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc.

num_top_programs: 3
num_diverse_programs: 2
use_template_stochasticity: true

# Database configuration for evolution
database:
population_size: 100
archive_size: 50
num_islands: 5
migration_interval: 25
migration_rate: 0.1
elite_selection_ratio: 0.1
exploration_ratio: 0.2
exploitation_ratio: 0.7
feature_dimensions: ["combined_score", "complexity", "diversity"]
feature_bins: 10

# Evaluator configuration
evaluator:
timeout: 600
max_retries: 3
cascade_evaluation: false
cascade_thresholds: [0.3, 0.6]
parallel_evaluations: 4
use_llm_feedback: false

# Evolution settings
diff_based_evolution: false
max_code_length: 100000
Loading