algorithmicsuperintelligence · codelion · Dec 6, 2025 · Dec 6, 2025 · Dec 6, 2025
diff --git a/examples/sldbench/README.md b/examples/sldbench/README.md
@@ -0,0 +1,162 @@
+# SLDBench — Scaling Law Discovery Benchmark
+
+## Introduction
+
+**SLDBench** is a benchmark for discovering scaling laws, originally introduced in the paper [*Can Language Models Discover Their Own Scaling Laws?*](https://arxiv.org/abs/2507.21184) by Lin et al. It aggregates over **5,000 LLM training experiments** from recent scaling-law literature into a unified dataset, hosted on the Hugging Face Hub at [`pkuHaowei/sldbench`](https://huggingface.co/datasets/pkuHaowei/sldbench). 
+
+Also check this [blog](https://algorithmicsuperintelligence.ai/blog/openevolve-sldagent/) for quickly understanding OpenEvolve x SLDBench.
+
+## Overview
+
+SLDBench focuses on **discovery** rather than simple curve fitting. The agent must identify:
+
+- A **symbolic law** $f_\theta(x)$ (the functional form).
+- A **parameter fitting routine** that generalizes across multiple training scenarios.
+
+**Key Features:**
+
+- **Data Source:** All task data is pulled dynamically from the Hugging Face dataset.
+- **Extrapolation Evaluation:** Models are trained on smaller-scale runs and strictly evaluated on held-out, larger-scale configurations to test predictive capability.
+- **Evolutionary Loop:** OpenEvolve iteratively mutates and evaluates candidate implementations of `scaling_law_func(...)` (the symbolic law) and `fit_scaling_law(...)` (the optimizer).
+
+------
+
+## SLDBench Tasks
+
+There are currently 7 core scaling-law discovery tasks, each derived from real-world LLM experiments. Configuration files for these tasks are located in `examples/sldbench/configs/`.
+
+| **Task Name (Config)**           | **Scenario**                                 | **Inputs (X)**                                       | **Target (y)**                 |
+| -------------------------------- | -------------------------------------------- | ---------------------------------------------------- | ------------------------------ |
+| **parallel_scaling_law**         | Parallel / Best-of-N inference scaling.      | Model size $N$, Parallelism $P$                      | Loss $L(N, P)$                 |
+| **vocab_scaling_law**            | Vocabulary size vs. model/data scaling.      | Non-vocab size $N$, Vocab size $V$, Dataset size $D$ | Unigram-normalized loss $L$    |
+| **sft_scaling_law**              | Supervised Fine-Tuning (SFT).                | SFT dataset size $D$                                 | Fine-tuning loss $L(D)$        |
+| **domain_mixture_scaling_law**   | Multi-domain pre-training mixtures.          | Domain mixture proportions $r$                       | Per-domain losses $\{L_i(r)\}$ |
+| **moe_scaling_law**              | Mixture-of-Experts (MoE) scaling.            | Network size $N$, Experts $E$                        | Pre-training loss $L(N, E)$    |
+| **data_constrained_scaling_law** | Data-constrained pre-training regimes.       | Model size $N$, Dataset size $D$, Unique tokens $U$  | Loss $L(N, D, U)$              |
+| **lr_bsz_scaling_law**           | Joint Learning Rate / Batch Size (Step Law). | LR $l$, Batch size $b$, Dataset $D$, Model $N$       | Loss $L(l, b, D, N)$ & Optima  |
+
+> **Note:** A task named `easy_question_scaling_law` is also included for U-shape scaling studies, though it is not part of the current paper reference.
+
+------
+
+## File Structure
+
+- `configs/` — YAML configuration files defining data splits, features, targets, and evaluation settings for each task.
+- `data_loader.py` — Unified data loader for the `pkuHaowei/sldbench` Hugging Face dataset.
+- `evaluator.py` — The evaluation framework; handles data splitting (train/extrapolate) and metric computation.
+- `init_program.py` — The seed implementation (a power-law–style baseline) to jumpstart the evolutionary search.
+
+------
+
+## Usage
+
+### Configuration Prerequisites
+
+Before running any tasks, ensure your API key environment variable is set for your API provider (e.g., `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`).
+
+### Running Individual Tasks
+
+To run the evolutionary process for a specific task:
+
+```bash
+python openevolve-run.py \
+  examples/sldbench/init_program.py \
+  examples/sldbench/evaluator.py \
+  --config examples/sldbench/configs/sft_scaling_law.yaml \
+  --api-base "https://api.openai.com/v1" \
+  --iterations 50
+```
+
+To switch tasks, simply point the `--config` argument to a different YAML file found in `examples/sldbench/configs/`.
+
+### Automated Benchmark Script
+
+A complete benchmark script `run.sh` is provided for running multiple tasks across different models with parallelism:
+
+```bash
+cd examples/sldbench
+chmod +x run.sh
+./run.sh 4  # Run with parallelism degree of 4 per model
+```
+
+**Note**: Configure the `API_BASE` variable in the script (defaults to `https://api.openai.com/v1`) and ensure your API key environment variable is set.
+
+The script handles evolution and evaluation automatically, storing results in `./results/`.
+
+### Important: Test Set Evaluation
+
+**Note:** The `openevolve-run` command only evaluates programs on the **training set** during evolution. To compute final metrics on the **test set**, you must explicitly run:
+
+```bash
+python evaluator.py "path/to/generated_program.py"
+```
+
+The `evaluator.py` script, when run in `__main__` mode, computes metrics on the held-out extrapolation test set, which is the proper way to evaluate the discovered scaling laws' predictive capability.
+
+------
+
+## Data Format & Evaluation
+
+Each task is formulated as a scaling-law discovery problem containing:
+
+1. **Features ($X$):** Input variables (e.g., $N, D, \text{LR}, \text{Batch Size}$).
+2. **Targets ($y$):** Performance metrics (typically training or validation loss).
+3. **Groups:** Control indices representing distinct experimental settings (e.g., different model architectures) that share the law *form* but require distinct fitted *parameters*.
+
+### The Evaluation Process
+
+1. **Splitting:** The evaluator partitions data into **training** and **extrapolation test** sets. The largest models or datasets are explicitly held out to mirror real-world forecasting needs.
+2. **Fitting:** The `fit_scaling_law` function optimizes parameters on the training portion for each group.
+3. **Scoring:** The fitted law is applied to the test set to compute the following metrics:
+
+- **NMSE:** Normalized Mean Squared Error
+- **NMAE:** Normalized Mean Absolute Error
+- **$R^2$:** Coefficient of Determination
+- **Combined Score:** A single scalar summary (currently equivalent to $R^2$).
+
+*Higher combined scores indicate superior extrapolation quality.*
+
+------
+
+## Evolution Markers
+
+OpenEvolve modifies code explicitly wrapped in evolution blocks. The agent evolves the symbolic form and the optimizer simultaneously:
+
+Python
+
+```
+# EVOLVE-BLOCK-START
+def scaling_law_func(data_points, params):
+    # Returns predicted values given inputs and parameters
+    pass
+
+def fit_scaling_law(data_points, loss_values):
+    # Optimizes parameters to fit the scaling law
+    pass
+# EVOLVE-BLOCK-END
+```
+
+The system mutates these blocks, evaluates them via `evaluator.py`, and maintains a database of the highest-performing implementations.
+
+## Requirements
+
+Bash
+
+```
+pip install datasets numpy scipy
+# Ensure the latest version of openevolve is installed
+```
+
+## Citation
+
+If you utilize SLDBench, this example, or derived results in your work, please cite the original paper:
+
+```
+@article{lin2025sldbench,
+  title   = {Can Language Models Discover Scaling Laws?},
+  author  = {Lin, Haowei and Ye, Haotian and Feng, Wenzheng and Huang, Quzhe and
+             Li, Yujun and Lim, Hubert and Li, Zhengrui and Wang, Xiangyu and
+             Ma, Jianzhu and Liang, Yitao and Zou, James},
+  journal = {arXiv preprint arXiv:2507.21184},
+  year    = {2025}
+}
diff --git a/examples/sldbench/configs/data_constrained_scaling_law.yaml b/examples/sldbench/configs/data_constrained_scaling_law.yaml
@@ -0,0 +1,88 @@
+# Configuration for data constrained scaling law discovery with OpenEvolve
+max_iterations: 50
+checkpoint_interval: 1
+log_level: "INFO"
+random_seed: 42
+
+# LLM configuration
+llm:
+  primary_model: null
+  primary_model_weight: 1.0
+  secondary_model: null
+  secondary_model_weight: 0.0
+  api_base: ""
+  max_tokens: 16384
+  timeout: 240
+  retries: 10
+  retry_delay: 10
+
+# Prompt configuration
+prompt:
+  system_message: |
+    You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between training data characteristics and model loss under data-constrained conditions.
+
+    **IMPORTANT: The scaling law function must use no more than 7 parameters.**
+
+    Focus on mathematical accuracy across different data scales, cross-dataset generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability.
+
+    **DATA CHARACTERISTICS (182 total data points):**
+    - Features: [unique_tokens, params, tokens] - 3D input
+    - Labels: loss - scalar output
+    - Dataset size: 161 
+    - Parameter range (P): 1.1e8 to 1.1e9 (100M to 1.1B parameters)
+    - Token count range (D): 1e9 to 1e12 tokens
+    - Unique tokens range: 1e7 to 5e8 unique tokens
+    - Loss range: 1.8 to 7.2 (cross-entropy loss)
+    - Model architectures: Transformer variants with different parameterizations
+    - Data explores scaling under token/unique-token constraints
+
+    The function signatures must remain:
+
+    ```python
+    def scaling_law_func(data_points, params):
+        # data_points: (N,3) array with columns [unique_tokens, params, tokens]
+        # tokens: Array of token counts
+        # params: Array of parameter counts
+        # unique_tokens: Array of unique token counts
+        # params: Array of up to 7 parameters
+        # Returns: Predicted loss values
+
+    def fit_scaling_law(data_points, loss_values):
+        # data_points: (N,3) array with columns [unique_tokens, params, tokens]
+        # loss_values: Array of corresponding loss values
+        # Returns: Optimized parameters (up to 7 parameters)
+    ```
+
+    Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers.
+
+    You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc.
+
+  num_top_programs: 3
+  num_diverse_programs: 2
+  use_template_stochasticity: true
+
+# Database configuration for evolution
+database:
+  population_size: 100
+  archive_size: 50
+  num_islands: 5
+  migration_interval: 25
+  migration_rate: 0.1
+  elite_selection_ratio: 0.1
+  exploration_ratio: 0.2
+  exploitation_ratio: 0.7
+  feature_dimensions: ["combined_score", "complexity", "diversity"]
+  feature_bins: 10
+
+# Evaluator configuration
+evaluator:
+  timeout: 600
+  max_retries: 3
+  cascade_evaluation: false
+  cascade_thresholds: [0.3, 0.6]
+  parallel_evaluations: 4
+  use_llm_feedback: false
+
+# Evolution settings
+diff_based_evolution: false
+max_code_length: 100000
diff --git a/examples/sldbench/configs/domain_mixture_scaling_law.yaml b/examples/sldbench/configs/domain_mixture_scaling_law.yaml
@@ -0,0 +1,85 @@
+# Configuration for domain mixture scaling law discovery with OpenEvolve
+max_iterations: 50
+checkpoint_interval: 1
+log_level: "INFO"
+random_seed: 42
+
+# LLM configuration
+llm:
+  primary_model: null
+  primary_model_weight: 1.0
+  secondary_model: null
+  secondary_model_weight: 0.0
+  api_base: ""
+  max_tokens: 16384
+  timeout: 240
+  retries: 10
+  retry_delay: 10
+
+# Prompt configuration
+prompt:
+  system_message: |
+    You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between domain mixture proportions and multi-domain loss values across different model sizes.
+
+    **IMPORTANT: The scaling law function must use no more than 35 parameters.**
+
+    Focus on mathematical accuracy across different model sizes, cross-domain generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability.
+
+    **DATA CHARACTERISTICS**
+    - Features: Domain proportions (5 domains) - array of shape (n_mixtures, 5)
+    - Labels: Multi-domain losses (5 domains) - array of shape (n_mixtures, 5)
+    - Dataset size: 80 training (20 per model size)
+    - Model parameter sizes: 70M, 160M, 410M, 1B parameters (4 separate groups)
+    - Domain proportions: Each row sums to 1.0 (mixture weights)
+    - Loss ranges: Domain losses span 1.8-4.2 cross-entropy loss
+    - Mixture configurations: Systematic exploration of different domain weight combinations
+    - This is a multi-output regression problem with correlated domain performances
+
+    The function signatures must remain:
+
+    ```python
+    def scaling_law_func(data_points, params):
+        # data_points: (N,5) array with domain proportions for 5 domains
+        # proportions: Array of domain mixture proportions
+        # params: Array of up to 35 parameters
+        # Returns: Predicted multi-domain loss values (N,5)
+
+    def fit_scaling_law(data_points, loss_values):
+        # data_points: (N,5) array with domain proportions for 5 domains
+        # loss_values: Array of corresponding multi-domain losses (N,5)
+        # Returns: Optimized parameters (up to 35 parameters)
+    ```
+
+    Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers.
+
+    You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc.
+
+  num_top_programs: 3
+  num_diverse_programs: 2
+  use_template_stochasticity: true
+
+# Database configuration for evolution
+database:
+  population_size: 100
+  archive_size: 50
+  num_islands: 5
+  migration_interval: 25
+  migration_rate: 0.1
+  elite_selection_ratio: 0.1
+  exploration_ratio: 0.2
+  exploitation_ratio: 0.7
+  feature_dimensions: ["combined_score", "complexity", "diversity"]
+  feature_bins: 10
+
+# Evaluator configuration
+evaluator:
+  timeout: 600
+  max_retries: 3
+  cascade_evaluation: false
+  cascade_thresholds: [0.3, 0.6]
+  parallel_evaluations: 4
+  use_llm_feedback: false
+
+# Evolution settings
+diff_based_evolution: false
+max_code_length: 100000