diff --git a/examples/sldbench/README.md b/examples/sldbench/README.md new file mode 100755 index 000000000..3bb56cbc9 --- /dev/null +++ b/examples/sldbench/README.md @@ -0,0 +1,162 @@ +# SLDBench — Scaling Law Discovery Benchmark + +## Introduction + +**SLDBench** is a benchmark for discovering scaling laws, originally introduced in the paper [*Can Language Models Discover Their Own Scaling Laws?*](https://arxiv.org/abs/2507.21184) by Lin et al. It aggregates over **5,000 LLM training experiments** from recent scaling-law literature into a unified dataset, hosted on the Hugging Face Hub at [`pkuHaowei/sldbench`](https://huggingface.co/datasets/pkuHaowei/sldbench). + +Also check this [blog](https://algorithmicsuperintelligence.ai/blog/openevolve-sldagent/) for quickly understanding OpenEvolve x SLDBench. + +## Overview + +SLDBench focuses on **discovery** rather than simple curve fitting. The agent must identify: + +- A **symbolic law** $f_\theta(x)$ (the functional form). +- A **parameter fitting routine** that generalizes across multiple training scenarios. + +**Key Features:** + +- **Data Source:** All task data is pulled dynamically from the Hugging Face dataset. +- **Extrapolation Evaluation:** Models are trained on smaller-scale runs and strictly evaluated on held-out, larger-scale configurations to test predictive capability. +- **Evolutionary Loop:** OpenEvolve iteratively mutates and evaluates candidate implementations of `scaling_law_func(...)` (the symbolic law) and `fit_scaling_law(...)` (the optimizer). + +------ + +## SLDBench Tasks + +There are currently 7 core scaling-law discovery tasks, each derived from real-world LLM experiments. Configuration files for these tasks are located in `examples/sldbench/configs/`. + +| **Task Name (Config)** | **Scenario** | **Inputs (X)** | **Target (y)** | +| -------------------------------- | -------------------------------------------- | ---------------------------------------------------- | ------------------------------ | +| **parallel_scaling_law** | Parallel / Best-of-N inference scaling. | Model size $N$, Parallelism $P$ | Loss $L(N, P)$ | +| **vocab_scaling_law** | Vocabulary size vs. model/data scaling. | Non-vocab size $N$, Vocab size $V$, Dataset size $D$ | Unigram-normalized loss $L$ | +| **sft_scaling_law** | Supervised Fine-Tuning (SFT). | SFT dataset size $D$ | Fine-tuning loss $L(D)$ | +| **domain_mixture_scaling_law** | Multi-domain pre-training mixtures. | Domain mixture proportions $r$ | Per-domain losses $\{L_i(r)\}$ | +| **moe_scaling_law** | Mixture-of-Experts (MoE) scaling. | Network size $N$, Experts $E$ | Pre-training loss $L(N, E)$ | +| **data_constrained_scaling_law** | Data-constrained pre-training regimes. | Model size $N$, Dataset size $D$, Unique tokens $U$ | Loss $L(N, D, U)$ | +| **lr_bsz_scaling_law** | Joint Learning Rate / Batch Size (Step Law). | LR $l$, Batch size $b$, Dataset $D$, Model $N$ | Loss $L(l, b, D, N)$ & Optima | + +> **Note:** A task named `easy_question_scaling_law` is also included for U-shape scaling studies, though it is not part of the current paper reference. + +------ + +## File Structure + +- `configs/` — YAML configuration files defining data splits, features, targets, and evaluation settings for each task. +- `data_loader.py` — Unified data loader for the `pkuHaowei/sldbench` Hugging Face dataset. +- `evaluator.py` — The evaluation framework; handles data splitting (train/extrapolate) and metric computation. +- `init_program.py` — The seed implementation (a power-law–style baseline) to jumpstart the evolutionary search. + +------ + +## Usage + +### Configuration Prerequisites + +Before running any tasks, ensure your API key environment variable is set for your API provider (e.g., `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`). + +### Running Individual Tasks + +To run the evolutionary process for a specific task: + +```bash +python openevolve-run.py \ + examples/sldbench/init_program.py \ + examples/sldbench/evaluator.py \ + --config examples/sldbench/configs/sft_scaling_law.yaml \ + --api-base "https://api.openai.com/v1" \ + --iterations 50 +``` + +To switch tasks, simply point the `--config` argument to a different YAML file found in `examples/sldbench/configs/`. + +### Automated Benchmark Script + +A complete benchmark script `run.sh` is provided for running multiple tasks across different models with parallelism: + +```bash +cd examples/sldbench +chmod +x run.sh +./run.sh 4 # Run with parallelism degree of 4 per model +``` + +**Note**: Configure the `API_BASE` variable in the script (defaults to `https://api.openai.com/v1`) and ensure your API key environment variable is set. + +The script handles evolution and evaluation automatically, storing results in `./results/`. + +### Important: Test Set Evaluation + +**Note:** The `openevolve-run` command only evaluates programs on the **training set** during evolution. To compute final metrics on the **test set**, you must explicitly run: + +```bash +python evaluator.py "path/to/generated_program.py" +``` + +The `evaluator.py` script, when run in `__main__` mode, computes metrics on the held-out extrapolation test set, which is the proper way to evaluate the discovered scaling laws' predictive capability. + +------ + +## Data Format & Evaluation + +Each task is formulated as a scaling-law discovery problem containing: + +1. **Features ($X$):** Input variables (e.g., $N, D, \text{LR}, \text{Batch Size}$). +2. **Targets ($y$):** Performance metrics (typically training or validation loss). +3. **Groups:** Control indices representing distinct experimental settings (e.g., different model architectures) that share the law *form* but require distinct fitted *parameters*. + +### The Evaluation Process + +1. **Splitting:** The evaluator partitions data into **training** and **extrapolation test** sets. The largest models or datasets are explicitly held out to mirror real-world forecasting needs. +2. **Fitting:** The `fit_scaling_law` function optimizes parameters on the training portion for each group. +3. **Scoring:** The fitted law is applied to the test set to compute the following metrics: + +- **NMSE:** Normalized Mean Squared Error +- **NMAE:** Normalized Mean Absolute Error +- **$R^2$:** Coefficient of Determination +- **Combined Score:** A single scalar summary (currently equivalent to $R^2$). + +*Higher combined scores indicate superior extrapolation quality.* + +------ + +## Evolution Markers + +OpenEvolve modifies code explicitly wrapped in evolution blocks. The agent evolves the symbolic form and the optimizer simultaneously: + +Python + +``` +# EVOLVE-BLOCK-START +def scaling_law_func(data_points, params): + # Returns predicted values given inputs and parameters + pass + +def fit_scaling_law(data_points, loss_values): + # Optimizes parameters to fit the scaling law + pass +# EVOLVE-BLOCK-END +``` + +The system mutates these blocks, evaluates them via `evaluator.py`, and maintains a database of the highest-performing implementations. + +## Requirements + +Bash + +``` +pip install datasets numpy scipy +# Ensure the latest version of openevolve is installed +``` + +## Citation + +If you utilize SLDBench, this example, or derived results in your work, please cite the original paper: + +``` +@article{lin2025sldbench, + title = {Can Language Models Discover Scaling Laws?}, + author = {Lin, Haowei and Ye, Haotian and Feng, Wenzheng and Huang, Quzhe and + Li, Yujun and Lim, Hubert and Li, Zhengrui and Wang, Xiangyu and + Ma, Jianzhu and Liang, Yitao and Zou, James}, + journal = {arXiv preprint arXiv:2507.21184}, + year = {2025} +} diff --git a/examples/sldbench/configs/data_constrained_scaling_law.yaml b/examples/sldbench/configs/data_constrained_scaling_law.yaml new file mode 100644 index 000000000..422968eaf --- /dev/null +++ b/examples/sldbench/configs/data_constrained_scaling_law.yaml @@ -0,0 +1,88 @@ +# Configuration for data constrained scaling law discovery with OpenEvolve +max_iterations: 50 +checkpoint_interval: 1 +log_level: "INFO" +random_seed: 42 + +# LLM configuration +llm: + primary_model: null + primary_model_weight: 1.0 + secondary_model: null + secondary_model_weight: 0.0 + api_base: "" + max_tokens: 16384 + timeout: 240 + retries: 10 + retry_delay: 10 + +# Prompt configuration +prompt: + system_message: | + You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between training data characteristics and model loss under data-constrained conditions. + + **IMPORTANT: The scaling law function must use no more than 7 parameters.** + + Focus on mathematical accuracy across different data scales, cross-dataset generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability. + + **DATA CHARACTERISTICS (182 total data points):** + - Features: [unique_tokens, params, tokens] - 3D input + - Labels: loss - scalar output + - Dataset size: 161 + - Parameter range (P): 1.1e8 to 1.1e9 (100M to 1.1B parameters) + - Token count range (D): 1e9 to 1e12 tokens + - Unique tokens range: 1e7 to 5e8 unique tokens + - Loss range: 1.8 to 7.2 (cross-entropy loss) + - Model architectures: Transformer variants with different parameterizations + - Data explores scaling under token/unique-token constraints + + The function signatures must remain: + + ```python + def scaling_law_func(data_points, params): + # data_points: (N,3) array with columns [unique_tokens, params, tokens] + # tokens: Array of token counts + # params: Array of parameter counts + # unique_tokens: Array of unique token counts + # params: Array of up to 7 parameters + # Returns: Predicted loss values + + def fit_scaling_law(data_points, loss_values): + # data_points: (N,3) array with columns [unique_tokens, params, tokens] + # loss_values: Array of corresponding loss values + # Returns: Optimized parameters (up to 7 parameters) + ``` + + Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers. + + You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc. + + num_top_programs: 3 + num_diverse_programs: 2 + use_template_stochasticity: true + +# Database configuration for evolution +database: + population_size: 100 + archive_size: 50 + num_islands: 5 + migration_interval: 25 + migration_rate: 0.1 + elite_selection_ratio: 0.1 + exploration_ratio: 0.2 + exploitation_ratio: 0.7 + feature_dimensions: ["combined_score", "complexity", "diversity"] + feature_bins: 10 + +# Evaluator configuration +evaluator: + timeout: 600 + max_retries: 3 + cascade_evaluation: false + cascade_thresholds: [0.3, 0.6] + parallel_evaluations: 4 + use_llm_feedback: false + +# Evolution settings +diff_based_evolution: false +max_code_length: 100000 \ No newline at end of file diff --git a/examples/sldbench/configs/domain_mixture_scaling_law.yaml b/examples/sldbench/configs/domain_mixture_scaling_law.yaml new file mode 100644 index 000000000..95624b167 --- /dev/null +++ b/examples/sldbench/configs/domain_mixture_scaling_law.yaml @@ -0,0 +1,85 @@ +# Configuration for domain mixture scaling law discovery with OpenEvolve +max_iterations: 50 +checkpoint_interval: 1 +log_level: "INFO" +random_seed: 42 + +# LLM configuration +llm: + primary_model: null + primary_model_weight: 1.0 + secondary_model: null + secondary_model_weight: 0.0 + api_base: "" + max_tokens: 16384 + timeout: 240 + retries: 10 + retry_delay: 10 + +# Prompt configuration +prompt: + system_message: | + You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between domain mixture proportions and multi-domain loss values across different model sizes. + + **IMPORTANT: The scaling law function must use no more than 35 parameters.** + + Focus on mathematical accuracy across different model sizes, cross-domain generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability. + + **DATA CHARACTERISTICS** + - Features: Domain proportions (5 domains) - array of shape (n_mixtures, 5) + - Labels: Multi-domain losses (5 domains) - array of shape (n_mixtures, 5) + - Dataset size: 80 training (20 per model size) + - Model parameter sizes: 70M, 160M, 410M, 1B parameters (4 separate groups) + - Domain proportions: Each row sums to 1.0 (mixture weights) + - Loss ranges: Domain losses span 1.8-4.2 cross-entropy loss + - Mixture configurations: Systematic exploration of different domain weight combinations + - This is a multi-output regression problem with correlated domain performances + + The function signatures must remain: + + ```python + def scaling_law_func(data_points, params): + # data_points: (N,5) array with domain proportions for 5 domains + # proportions: Array of domain mixture proportions + # params: Array of up to 35 parameters + # Returns: Predicted multi-domain loss values (N,5) + + def fit_scaling_law(data_points, loss_values): + # data_points: (N,5) array with domain proportions for 5 domains + # loss_values: Array of corresponding multi-domain losses (N,5) + # Returns: Optimized parameters (up to 35 parameters) + ``` + + Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers. + + You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc. + + num_top_programs: 3 + num_diverse_programs: 2 + use_template_stochasticity: true + +# Database configuration for evolution +database: + population_size: 100 + archive_size: 50 + num_islands: 5 + migration_interval: 25 + migration_rate: 0.1 + elite_selection_ratio: 0.1 + exploration_ratio: 0.2 + exploitation_ratio: 0.7 + feature_dimensions: ["combined_score", "complexity", "diversity"] + feature_bins: 10 + +# Evaluator configuration +evaluator: + timeout: 600 + max_retries: 3 + cascade_evaluation: false + cascade_thresholds: [0.3, 0.6] + parallel_evaluations: 4 + use_llm_feedback: false + +# Evolution settings +diff_based_evolution: false +max_code_length: 100000 \ No newline at end of file diff --git a/examples/sldbench/configs/easy_question_scaling_law.yaml b/examples/sldbench/configs/easy_question_scaling_law.yaml new file mode 100755 index 000000000..f7a075819 --- /dev/null +++ b/examples/sldbench/configs/easy_question_scaling_law.yaml @@ -0,0 +1,78 @@ +# Configuration for easy question scaling law discovery with OpenEvolve +max_iterations: 50 +checkpoint_interval: 1 +log_level: "INFO" +random_seed: 42 + +# LLM configuration +llm: + api_base: "" + max_tokens: 16384 + timeout: 240 + retries: 10 + retry_delay: 10 + +# Prompt configuration +prompt: + system_message: | + You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between compute (FLOPs) and LLM performance on easy questions, which exhibits a characteristic U-shaped or double descent pattern. + + **IMPORTANT: The scaling law function must use no more than 6 parameters to capture the U-shaped scaling pattern.** + + Focus on mathematical accuracy across different benchmark tasks, cross-task generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability. The U-shaped pattern is critical: performance initially worsens with scale before improving again. + + **DATA CHARACTERISTICS:** + - Features: [log_flops] - 1D input in log10 scale + - Labels: brier_score - scalar output (negative values, more negative = better) + - Dataset size: 516 total (389 train + 127 test) + - Log FLOPs range: [-0.9, 2.9] approximately (log10 of FLOPs in 1E21 units) + + The function signatures must remain: + + ```python + def scaling_law_func(data_points, params): + # data_points: (N,1) array with columns [log_flops] + # log_flops: Array of log10(FLOPs in 1E21 units) + # params: Array of up to 6 parameters to capture U-shaped pattern + # Returns: Predicted brier_score values (negative) + + def fit_scaling_law(data_points, loss_values): + # data_points: (N,1) array with columns [log_flops] + # log_flops: Array of log10(FLOPs in 1E21 units) + # loss_values: Array of corresponding brier_score values + # Returns: Optimized parameters (up to 6 parameters) + ``` + + Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers. + + You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc. + + num_top_programs: 3 + num_diverse_programs: 2 + use_template_stochasticity: true + +# Database configuration for evolution +database: + population_size: 100 + archive_size: 50 + num_islands: 5 + migration_interval: 25 + migration_rate: 0.1 + elite_selection_ratio: 0.1 + exploration_ratio: 0.2 + exploitation_ratio: 0.7 + feature_dimensions: ["combined_score", "complexity", "diversity"] + feature_bins: 10 + +# Evaluator configuration +evaluator: + timeout: 600 + max_retries: 3 + cascade_evaluation: false + cascade_thresholds: [0.3, 0.6] + parallel_evaluations: 4 + use_llm_feedback: false + +# Evolution settings +diff_based_evolution: false +max_code_length: 100000 diff --git a/examples/sldbench/configs/lr_bsz_scaling_law.yaml b/examples/sldbench/configs/lr_bsz_scaling_law.yaml new file mode 100644 index 000000000..9d8878605 --- /dev/null +++ b/examples/sldbench/configs/lr_bsz_scaling_law.yaml @@ -0,0 +1,92 @@ +# Configuration for learning rate scaling law discovery with OpenEvolve +max_iterations: 50 +checkpoint_interval: 1 +log_level: "INFO" +random_seed: 42 + +# LLM configuration +llm: + primary_model: null + primary_model_weight: 1.0 + secondary_model: null + secondary_model_weight: 0.0 + api_base: "" + max_tokens: 16384 + timeout: 240 + retries: 10 + retry_delay: 10 + +# Prompt configuration +prompt: + system_message: | + You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between learning rate, batch size, data size, model parameters and training loss. + + You are allowed to decide the number of parameters in the scaling law function. + + Focus on mathematical accuracy across different hyperparameter scales, cross-configuration generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability. + + **DATA CHARACTERISTICS (2702 total data points):** + - Features: [lr, bsz, data_size, non_embedding_param_size] - 4D input + - Labels: lm_loss - scalar output + - Dataset size: 2702 total + - Learning rate range: 2.44e-4 to 2.21e-2 (logarithmically spaced) + - Batch size range: 16 to 2048 (powers of 2) + - Data size range: 2.0e9 to 1.0e11 tokens (2B to 100B tokens) + - Parameter range: 6.00e7 to 1.07e9 (60M to 1.07B non-embedding parameters) + - Loss range: 2.1 to 3.7 cross-entropy loss + - Comprehensive hyperparameter sweep covering learning rate and batch size effects + + The function signatures must remain: + + ```python + def scaling_law_func(data_points, params): + # data_points: (N,4) array with columns [lr, bsz, data_size, non_embedding_param_size] + # lr: Array of learning rates + # bsz: Array of batch sizes + # data_size: Array of data sizes + # non_embedding_param_size: Array of non-embedding parameter sizes + # Returns: Predicted lm loss values + - Model parameters (N) range: ~214M to ~1B parameters + - Training tokens (D) range: 4B to 100B tokens + - Learning rates range: 1.2e-4 to 2.2e-2 + - Batch sizes range: 16 to 4096 + + def fit_scaling_law(data_points, loss_values): + # data_points: (N,4) array with columns [lr, bsz, data_size, non_embedding_param_size] + # loss_values: Array of corresponding lm loss values + # Returns: Optimized parameters + ``` + + Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers. + + You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc. + + num_top_programs: 3 + num_diverse_programs: 2 + use_template_stochasticity: true + +# Database configuration for evolution +database: + population_size: 100 + archive_size: 50 + num_islands: 5 + migration_interval: 25 + migration_rate: 0.1 + elite_selection_ratio: 0.1 + exploration_ratio: 0.2 + exploitation_ratio: 0.7 + feature_dimensions: ["combined_score", "complexity", "diversity"] + feature_bins: 10 + +# Evaluator configuration +evaluator: + timeout: 600 + max_retries: 3 + cascade_evaluation: false + cascade_thresholds: [0.3, 0.6] + parallel_evaluations: 4 + use_llm_feedback: false + +# Evolution settings +diff_based_evolution: false +max_code_length: 100000 \ No newline at end of file diff --git a/examples/sldbench/configs/moe_scaling_law.yaml b/examples/sldbench/configs/moe_scaling_law.yaml new file mode 100644 index 000000000..dd2617967 --- /dev/null +++ b/examples/sldbench/configs/moe_scaling_law.yaml @@ -0,0 +1,87 @@ +# Configuration for MoE scaling law discovery with OpenEvolve +max_iterations: 50 +checkpoint_interval: 1 +log_level: "INFO" +random_seed: 42 + +# LLM configuration +llm: + primary_model: null + primary_model_weight: 1.0 + secondary_model: null + secondary_model_weight: 0.0 + api_base: "" + max_tokens: 16384 + timeout: 240 + retries: 10 + retry_delay: 10 + +# Prompt configuration +prompt: + system_message: | + You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between MoE architecture parameters and validation loss. + + **IMPORTANT: The scaling law function must use no more than 6 parameters.** + + Focus on mathematical accuracy across different MoE configurations, generalization across expert counts and model sizes, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability. + + **DATA CHARACTERISTICS** + - Features: [num_experts, dense_parameter_count] - 2D input + - Labels: loss_validation - scalar output + - Dataset size: 193 + - Number of experts: 1 to 64 experts (various configurations) + - Dense parameter count: 1e8 to 8e8 parameters (100M to 800M parameters) + - Validation loss range: 1.8 to 3.8 cross-entropy loss + - All data collected at training step 249000 for consistent comparison + - Includes both dense and MoE transformer architectures + - Explores trade-off between expert count and parameter efficiency + + The function signatures must remain: + + ```python + def scaling_law_func(data_points, params): + # data_points: (N,2) array with columns [num_experts, dense_parameter_count] + # num_experts: Array of expert counts + # dense_parameter_count: Array of dense parameter counts + # params: Array of up to 6 parameters + # Returns: Predicted validation loss values + + def fit_scaling_law(data_points, loss_values): + # data_points: (N,2) array with columns [num_experts, dense_parameter_count] + # loss_values: Array of corresponding validation loss values + # Returns: Optimized parameters (up to 6 parameters) + ``` + + Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers. + + You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc. + + num_top_programs: 3 + num_diverse_programs: 2 + use_template_stochasticity: true + +# Database configuration for evolution +database: + population_size: 100 + archive_size: 50 + num_islands: 5 + migration_interval: 25 + migration_rate: 0.1 + elite_selection_ratio: 0.1 + exploration_ratio: 0.2 + exploitation_ratio: 0.7 + feature_dimensions: ["combined_score", "complexity", "diversity"] + feature_bins: 10 + +# Evaluator configuration +evaluator: + timeout: 600 + max_retries: 3 + cascade_evaluation: false + cascade_thresholds: [0.3, 0.6] + parallel_evaluations: 4 + use_llm_feedback: false + +# Evolution settings +diff_based_evolution: false +max_code_length: 100000 \ No newline at end of file diff --git a/examples/sldbench/configs/parallel_scaling_law.yaml b/examples/sldbench/configs/parallel_scaling_law.yaml new file mode 100755 index 000000000..dc8374d7b --- /dev/null +++ b/examples/sldbench/configs/parallel_scaling_law.yaml @@ -0,0 +1,93 @@ +# Configuration for parallel scaling law discovery with OpenEvolve +max_iterations: 50 +checkpoint_interval: 1 +log_level: "INFO" +random_seed: 42 + +# LLM configuration +llm: + primary_model: null + primary_model_weight: 1.0 + secondary_model: null + secondary_model_weight: 0.0 + api_base: "" + max_tokens: 16384 + timeout: 240 + retries: 10 + retry_delay: 10 + +# Prompt configuration +prompt: + system_message: | + You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between model parameter count, parallel size, and language modeling loss. Here we apply `parallel_size` transformations to the input, execute forward passes of the model in parallel, and aggregate the `parallel_size` outputs. We call this method parallel scaling. + + **IMPORTANT: The scaling law function must use no more than 4 parameters.** + + Focus on mathematical accuracy across different parallel configurations, cross-dataset generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability. + + **DATA CHARACTERISTICS** + - Features: [num_params, parallel_size] - 2D input + - Labels: loss - scalar output + - Groups: 'pile' and 'stack' datasets (18 samples each) + - Parameter range: 5.36e8 to 4.38e9 parameters (536M to 4.38B) + - Parallel sizes: [1, 2, 4] copies + - Loss range by group: + - 'pile': 1.7938 to 2.1113 (higher loss values) + - 'stack': 0.9906 to 1.1722 (lower loss values) + - Key observation: Increasing parallel_size decreases loss + - parallel_size=1: avg loss 1.9780 (pile), 1.0972 (stack) + - parallel_size=2: avg loss 1.9480 (pile), 1.0767 (stack) + - parallel_size=4: avg loss 1.9259 (pile), 1.0635 (stack) + - Experimental setup: Augment input with parallel_size copies, pass through LLM, aggregate responses + + The function signatures must remain: + + ```python + def scaling_law_func(data_points, params): + # data_points: (N,2) array with columns [num_params, parallel_size] + # num_params: Array of model parameter counts + # parallel_size: Array of parallel copies for input augmentation + # params: Array of up to 4 parameters + # Returns: Predicted loss values + + def fit_scaling_law(data_points, loss_values): + # data_points: (N,2) array with columns [num_params, parallel_size] + # num_params: Array of model parameter counts + # parallel_size: Array of parallel copies for input augmentation + # loss_values: Array of corresponding loss values + # Returns: Optimized parameters (up to 4 parameters) + ``` + + Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers. + + You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc. + + num_top_programs: 3 + num_diverse_programs: 2 + use_template_stochasticity: true + +# Database configuration for evolution +database: + population_size: 100 + archive_size: 50 + num_islands: 5 + migration_interval: 25 + migration_rate: 0.1 + elite_selection_ratio: 0.1 + exploration_ratio: 0.2 + exploitation_ratio: 0.7 + feature_dimensions: ["combined_score", "complexity", "diversity"] + feature_bins: 10 + +# Evaluator configuration +evaluator: + timeout: 600 + max_retries: 3 + cascade_evaluation: false + cascade_thresholds: [0.3, 0.6] + parallel_evaluations: 4 + use_llm_feedback: false + +# Evolution settings +diff_based_evolution: false +max_code_length: 100000 \ No newline at end of file diff --git a/examples/sldbench/configs/sft_scaling_law.yaml b/examples/sldbench/configs/sft_scaling_law.yaml new file mode 100644 index 000000000..753d10108 --- /dev/null +++ b/examples/sldbench/configs/sft_scaling_law.yaml @@ -0,0 +1,87 @@ +# Configuration for rectified scaling law discovery with OpenEvolve +max_iterations: 50 +checkpoint_interval: 1 +log_level: "INFO" +random_seed: 42 + +# LLM configuration +llm: + primary_model: null + primary_model_weight: 1.0 + secondary_model: null + secondary_model_weight: 0.0 + api_base: "" + max_tokens: 16384 + timeout: 240 + retries: 10 + retry_delay: 10 + +# Prompt configuration +prompt: + system_message: | + You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between data size and loss values in supervised fine-tuning across different model-dataset combinations. + + **IMPORTANT: The scaling law function must use no more than 4 parameters.** + + Focus on mathematical accuracy across different model architectures, cross-dataset generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability. + + **DATA CHARACTERISTICS:** + - Features: [sft_data_size] - 1D input + - Labels: sft_loss - scalar output + - Dataset size: 504 (12 per group) + - Data size range: 200 to 819,200 examples (14 exponentially-spaced sizes) + - Model parameter range: 1.24e8 to 1.3e9 parameters (124M to 1.3B parameters) + - Loss range: 1.7 to 4.9 cross-entropy loss + - Datasets: Flan, Gigaword, and Wikiword instruction-tuning datasets + - Model architectures: Various transformer-based language models + - 42 distinct (model, dataset) configuration groups for cross-generalization + + The function signatures must remain: + + ```python + def scaling_law_func(data_points, params): + # data_points: (N,1) array with columns [data_size] + # data_size: Array of data sizes (200 to 819200) + # params: Array of up to 4 parameters + # Returns: Predicted loss values + + def fit_scaling_law(data_points, loss_values): + # data_points: (N,1) array with columns [data_size] + # data_size: Array of data sizes + # loss_values: Array of corresponding loss values + # Returns: Optimized parameters (up to 4 parameters) + ``` + + Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers. + + You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc. + + num_top_programs: 3 + num_diverse_programs: 2 + use_template_stochasticity: true + +# Database configuration for evolution +database: + population_size: 100 + archive_size: 50 + num_islands: 5 + migration_interval: 25 + migration_rate: 0.1 + elite_selection_ratio: 0.1 + exploration_ratio: 0.2 + exploitation_ratio: 0.7 + feature_dimensions: ["combined_score", "complexity", "diversity"] + feature_bins: 10 + +# Evaluator configuration +evaluator: + timeout: 600 + max_retries: 3 + cascade_evaluation: false + cascade_thresholds: [0.3, 0.6] + parallel_evaluations: 4 + use_llm_feedback: false + +# Evolution settings +diff_based_evolution: false +max_code_length: 100000 \ No newline at end of file diff --git a/examples/sldbench/configs/vocab_scaling_law.yaml b/examples/sldbench/configs/vocab_scaling_law.yaml new file mode 100644 index 000000000..cb1a89e20 --- /dev/null +++ b/examples/sldbench/configs/vocab_scaling_law.yaml @@ -0,0 +1,93 @@ +# Configuration for vocab scaling law discovery with OpenEvolve +max_iterations: 50 +checkpoint_interval: 1 +log_level: "INFO" +random_seed: 42 + +# LLM configuration +llm: + primary_model: null + primary_model_weight: 1.0 + secondary_model: null + secondary_model_weight: 0.0 + api_base: "" + max_tokens: 16384 + timeout: 240 + retries: 10 + retry_delay: 10 + +# Prompt configuration +prompt: + system_message: | + You are an expert in scaling laws and machine learning who specializes in discovering and improving scaling law functions for different LLM training scenarios. Your task is to evolve both the `scaling_law_func` function (currently a naive power law) and the `fit_scaling_law` optimization algorithm (currently a naive BFGS) to better model the relationship between vocabulary size, non-vocabulary parameters, number of characters and Lossu (unigram-normalized language model loss). + + **IMPORTANT: The scaling law function must use no more than 7 parameters.** + + Focus on mathematical accuracy across different vocabulary configurations, cross-dataset generalization, parameter efficiency (simple forms that can be fitted with limited data), and numerical/theoretical stability. + + **DATA CHARACTERISTICS** + - Features: [non_vocab_parameters, vocab_size, num_characters] - 3D input + - Labels: unigram_normalized_loss - scalar output + - Dataset size: 1080 + - Vocabulary size: 4096 to 96256 tokens (8 distinct sizes) + - Embedding dimension: 512 to 2048 dimensions (4 values) + - Character count: 1e8 to 5e12 characters (100M to 5T characters) + - Non-vocab parameters: 3.3e7 to 1.1e9 (33M to 1.1B parameters) + - FLOPs range: 1.3e16 to 4.4e20 operations + - Lossu range: -5.34 to -0.51 (negative values indicate improvement over unigram) + - Lossu measures improvement over context-free unigram model (negative = better) + - Explores vocabulary scaling trade-offs across parameter, data, and architecture dimensions + + The function signatures must remain: + + ```python + def scaling_law_func(data_points, params): + # data_points: (N,3) array with columns [P_non_vocab, vocab_size, num_characters] + # Non_vocab_parameters: Array of non-vocabulary parameter counts + # vocab_size: Array of vocabulary sizes + # num_characters: Array of number of characters processed + # params: Array of up to 7 parameters + # Returns: Predicted Lossu values + + def fit_scaling_law(data_points, loss_values): + # data_points: (N,3) array with columns [P_non_vocab, vocab_size, num_characters] + # Non_vocab_parameters: Array of non-vocabulary parameter counts + # vocab_size: Array of vocabulary sizes + # num_characters: Array of number of characters processed + # lossu_values: Array of corresponding Lossu values + # Returns: Optimized parameters (up to 7 parameters) + ``` + + Write all improvements between # EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END markers. + + You are not allowed to use input-dependent feature in scaling_law_func, e.g., median / min / max / etc. + + num_top_programs: 3 + num_diverse_programs: 2 + use_template_stochasticity: true + +# Database configuration for evolution +database: + population_size: 100 + archive_size: 50 + num_islands: 5 + migration_interval: 25 + migration_rate: 0.1 + elite_selection_ratio: 0.1 + exploration_ratio: 0.2 + exploitation_ratio: 0.7 + feature_dimensions: ["combined_score", "complexity", "diversity"] + feature_bins: 10 + +# Evaluator configuration +evaluator: + timeout: 600 + max_retries: 3 + cascade_evaluation: false + cascade_thresholds: [0.3, 0.6] + parallel_evaluations: 4 + use_llm_feedback: false + +# Evolution settings +diff_based_evolution: false +max_code_length: 100000 \ No newline at end of file diff --git a/examples/sldbench/data_loader.py b/examples/sldbench/data_loader.py new file mode 100644 index 000000000..d3589a104 --- /dev/null +++ b/examples/sldbench/data_loader.py @@ -0,0 +1,140 @@ +""" +Unified data loading interface for scaling law discovery. + +Dynamically loads data from the Hugging Face Hub repository 'pkuHaowei/sldbench'. +This approach centralizes data access and ensures consistency. +""" +import numpy as np +import datasets +from typing import Dict, Any, Tuple + +# --- Configuration --- + +HUB_REPO_ID = "pkuHaowei/sldbench" + +# Defines the schema for each task, mapping feature/target names from the Hub +# to the columns in the dataset. +TASK_SCHEMA_MAP = { + "data_constrained_scaling_law": { + "feature_names": ["unique_tokens", "params", "tokens"], + "target_name": "loss", + }, + "domain_mixture_scaling_law": { + "feature_names": [f"proportion_domain_{i+1}" for i in range(5)], + "target_name": [f"loss_domain_{i+1}" for i in range(5)], + }, + "lr_bsz_scaling_law": { + "feature_names": ["lr", "bsz", "data_size", "non_embedding_param_size"], + "target_name": "lm_loss", + }, + "moe_scaling_law": { + "feature_names": ["num_experts", "dense_parameter_count"], + "target_name": "loss_validation", + }, + "sft_scaling_law": { + "feature_names": ["sft_data_size"], + "target_name": "sft_loss", + }, + "vocab_scaling_law": { + "feature_names": ["non_vocab_parameters", "vocab_size", "num_characters"], + "target_name": "unigram_normalized_loss", + }, + "parallel_scaling_law": { + "feature_names": ["num_params", "parallel_size"], + "target_name": "loss" + }, + "easy_question_scaling_law": { + "feature_names": ["log_flops"], + "target_name": "brier_score", + } +} + +def load_data( + app_name: str, + train: bool = True, +) -> Dict[Any, Tuple[np.ndarray, np.ndarray]]: + """ + Unified data loading interface. Loads and processes data from Hugging Face Hub. + + Each task's dataset is grouped by a 'group' key. The function returns a + dictionary mapping each group key to a tuple of (features, labels). + - features (X): A numpy array of shape (n_samples, n_features). + - labels (y): A numpy array of shape (n_samples,) or (n_samples, n_targets). + + Args: + app_name: The name of the task (e.g., 'sft_scaling_law'). + train: If True, load training data; otherwise, load test data. + + Returns: + A dictionary containing the prepared data, structured by group. + """ + if app_name not in TASK_SCHEMA_MAP: + raise ValueError(f"Task '{app_name}' not found in TASK_SCHEMA_MAP. Available tasks: {list(TASK_SCHEMA_MAP.keys())}") + + split = 'train' if train else 'test' + schema = TASK_SCHEMA_MAP[app_name] + + try: + # Load the specific task dataset from the Hugging Face Hub + dataset = datasets.load_dataset(HUB_REPO_ID, name=app_name, split=split) + except Exception as e: + raise IOError(f"Failed to load dataset '{app_name}' with split '{split}' from '{HUB_REPO_ID}'. Reason: {e}") + + # Ensure target_name is a list for consistent processing + feature_names = schema["feature_names"] + target_names = schema["target_name"] + if not isinstance(target_names, list): + target_names = [target_names] + + processed_data = {} + + # The datasets are partitioned by a 'group' column + unique_groups = sorted(list(set(dataset['group']))) + + for group_key in unique_groups: + # Filter the dataset for the current group + group_data = dataset.filter(lambda example: example['group'] == group_key) + + # Extract features (X) and stack them into a single numpy array + X_list = [np.array(group_data[fname]) for fname in feature_names] + X = np.stack(X_list, axis=1) + + # Extract targets (y) + y_list = [np.array(group_data[tname]) for tname in target_names] + y_stacked = np.stack(y_list, axis=1) + + # Squeeze the last dimension if there is only one target + y = y_stacked.squeeze(axis=1) if y_stacked.shape[1] == 1 else y_stacked + + processed_data[group_key] = (X, y) + + return processed_data + +if __name__ == '__main__': + # Example of how to use the new loader + # The list of tasks is now derived directly from the schema map + ALL_TASKS = list(TASK_SCHEMA_MAP.keys()) + + for task in ALL_TASKS: + print(f"\n--- Testing '{task}' ---") + try: + # Load training data + train_data = load_data(task, train=True) + print(f"Successfully loaded training data from Hugging Face repo '{HUB_REPO_ID}'.") + + # Inspect the first group's shape + first_group_key = next(iter(train_data)) + X_train, y_train = train_data[first_group_key] + print(f"Train groups: {len(train_data)}. First group '{first_group_key}' shape: X={X_train.shape}, y={y_train.shape}") + + # Load test data + test_data = load_data(task, train=False) + if test_data: + first_test_key = next(iter(test_data)) + X_test, y_test = test_data[first_test_key] + print(f"Test groups: {len(test_data)}. First group '{first_test_key}' shape: X={X_test.shape}, y={y_test.shape}") + else: + print("Test data is empty.") + + except (ValueError, IOError, KeyError) as e: + print(f"Error loading data for task '{task}': {e}") \ No newline at end of file diff --git a/examples/sldbench/evaluator.py b/examples/sldbench/evaluator.py new file mode 100644 index 000000000..f4431cb42 --- /dev/null +++ b/examples/sldbench/evaluator.py @@ -0,0 +1,320 @@ +# -*- coding: utf-8 -*- +""" +Unified Evaluator for Scaling Law Discovery. +""" +import argparse +import concurrent.futures +import importlib.util +import os +import sys +import traceback +from pathlib import Path +from typing import Any, Dict, Union + +import numpy as np + +# Import data_loader from the same directory +from data_loader import load_data + +# --- Task Configuration --- +# A set of supported task names. The evaluator will infer which one to use. +SUPPORTED_TASKS = { + "sft_scaling_law", + "data_constrained_scaling_law", + "moe_scaling_law", + "vocab_scaling_law", + "domain_mixture_scaling_law", + "lr_bsz_scaling_law", + "parallel_scaling_law", + "easy_question_scaling_law" +} + +# --- Core Functions --- + +def get_failure_result(error_msg: str = "Evaluation failed or timed out.") -> Dict[str, Any]: + """Returns a standardized dictionary for failure cases.""" + return { + "nmse": 100000.0, + "nmae": 100000.0, + "r2": -1.0, + "combined_score": 0.0, + "error": error_msg, + } + +def run_with_timeout(func, args=(), kwargs={}, timeout_seconds: int = 600): + """Runs a function with a specified timeout, raising an exception on timeout.""" + with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor: + future = executor.submit(func, *args, **kwargs) + try: + return future.result(timeout=timeout_seconds) + except Exception as e: + print(f"Function {func.__name__} timed out or failed: {e}", file=sys.stderr) + raise + +def calculate_final_metrics( + predictions: np.ndarray, + true_values: np.ndarray, +) -> Dict[str, Any]: + """ + Calculates evaluation metrics, correctly handling multi-dimensional outputs. + + For multi-dimensional targets, metrics (NMSE, NMAE) are calculated for each + dimension separately and then averaged. The normalization factors (variance + and mean absolute deviation) are computed using only the test data. + + Args: + predictions: The model's predictions as a NumPy array. + true_values: The ground truth values from the test set as a NumPy array. + + Returns: + A dictionary containing aggregate and per-dimension metrics. + """ + # 1. Initial validation and type conversion + try: + pred = np.asarray(predictions, dtype=float) + true = np.asarray(true_values, dtype=float) + except (ValueError, TypeError): + return get_failure_result("Could not convert predictions or true values to float arrays.") + + # 2. Check for invalid values in predictions + if np.isnan(pred).any() or np.isinf(pred).any(): + return get_failure_result("Predictions contain NaN or Inf values.") + + # 3. Reshape 1D arrays to 2D column vectors for consistent processing + if true.ndim == 1: + true = true.reshape(-1, 1) + if pred.ndim == 1: + pred = pred.reshape(-1, 1) + + # 4. Final shape validation + if true.shape != pred.shape: + return get_failure_result(f"Shape mismatch: true values {true.shape} vs. predictions {pred.shape}.") + if true.size == 0: + return get_failure_result("Cannot evaluate on empty data.") + + # 5. Calculate per-dimension errors on the test set + test_mse_per_dim = np.mean((true - pred) ** 2, axis=0) + test_mae_per_dim = np.mean(np.abs(true - pred), axis=0) + + # 6. Calculate normalizers using the test set only + variance_per_dim = np.var(true, axis=0) + mean_abs_dev_per_dim = np.mean(np.abs(true - np.mean(true, axis=0)), axis=0) + + # 7. Calculate normalized metrics, avoiding division by zero + nmse_per_dim = np.divide(test_mse_per_dim, variance_per_dim, + out=np.full_like(test_mse_per_dim, np.inf), # Use np.inf where variance is zero + where=variance_per_dim > 1e-9) + nmae_per_dim = np.divide(test_mae_per_dim, mean_abs_dev_per_dim, + out=np.full_like(test_mae_per_dim, np.inf), # Use np.inf where MAD is zero + where=mean_abs_dev_per_dim > 1e-9) + + # 8. Calculate R^2 for each dimension + r2_per_dim = 1.0 - nmse_per_dim + + # 9. Average per-dimension metrics for final aggregate scores + nmse = np.mean(nmse_per_dim) + nmae = np.mean(nmae_per_dim) + # The standard definition of R^2 relates to the total variance, so it's 1 - (total MSE / total variance) + # which is equivalent to 1 - mean(nmse_per_dim) if variances are similar, but this is more direct. + r2 = 1.0 - nmse + + # 10. Compile the results dictionary + results = { + "nmse": float(nmse), + "nmae": float(nmae), + "r2": float(r2), + "combined_score": 1.0 / (1.0 + nmse), + } + + # 11. Add per-dimension metrics for multi-dimensional targets + if true.shape[1] > 1: + results["nmse_per_dim"] = nmse_per_dim.tolist() + results["nmae_per_dim"] = nmae_per_dim.tolist() + results["r2_per_dim"] = r2_per_dim.tolist() + + return results + + +def _import_program(program_path: str): + """Imports a Python module from a given file path.""" + spec = importlib.util.spec_from_file_location("scaling_law_module", program_path) + if spec is None or spec.loader is None: + raise ImportError(f"Could not create module spec from path: {program_path}") + module = importlib.util.module_from_spec(spec) + spec.loader.exec_module(module) + return module + +def resolve_task_name(program_path: str) -> str: + """Infers the task name from environment variables or the file path.""" + env_task = os.getenv("EVAL_TASK_NAME") or os.getenv("SCALING_TASK_NAME") + if env_task and env_task in SUPPORTED_TASKS: + return env_task + + p = Path(program_path) + parts_to_check = [p.parent.name, p.stem] + for part in parts_to_check: + for task in SUPPORTED_TASKS: + if task in part: + return task + + raise ValueError( + "Could not resolve task_name. Set env var EVAL_TASK_NAME or " + f"ensure a supported task name (e.g., '{next(iter(SUPPORTED_TASKS))}') " + "is in the script's parent folder or file name." + ) + +# --- Evaluation Pipelines --- + +def evaluate_core( + program_path: str, + task_name: str, + use_test_data: bool = False, + fitted_params_map: Dict[Any, Any] = None, +) -> Dict[str, Union[float, Dict]]: + """ + Core evaluation logic: fits a model or evaluates it on test data. + """ + try: + program = _import_program(program_path) + fit_scaling_law = program.fit_scaling_law + scaling_law_func = program.scaling_law_func + + if not use_test_data: + # --- FIT on training data --- + train_data = load_data(task_name, train=True) + if not train_data: + return get_failure_result("No training data found.") + + new_fitted_params_map = {} + for key, (X_train, y_train) in train_data.items(): + params = run_with_timeout(fit_scaling_law, args=(X_train, y_train)) + new_fitted_params_map[key] = params + return {"fitted_params": new_fitted_params_map} + + else: + # --- EVALUATE on test data --- + if fitted_params_map is None: + return get_failure_result("fitted_params_map is required for evaluation.") + + test_data = load_data(task_name, train=False) + if not test_data: + return get_failure_result("No test data found.") + + all_predictions, all_true_values = [], [] + for key, (X_test, y_test) in test_data.items(): + if key not in fitted_params_map: + print(f"Warning: No params for test group '{key}'. Skipping.", file=sys.stderr) + continue + + params = fitted_params_map[key] + predictions = run_with_timeout(scaling_law_func, args=(X_test, params)) + all_predictions.append(np.asarray(predictions)) + all_true_values.append(np.asarray(y_test)) + + if not all_predictions: + return get_failure_result("No predictions were generated for the test set.") + + final_predictions = np.concatenate(all_predictions) + final_true_values = np.concatenate(all_true_values) + + return calculate_final_metrics( + final_predictions, + final_true_values, + ) + + except Exception as e: + traceback.print_exc(file=sys.stderr) + return get_failure_result(str(e)) + +def evaluate(program_path: str, verbose: bool = False) -> Dict[str, Any]: + """ + High-level, single-call evaluation function. + + This orchestrates the entire process: + 1. Infers the task name. + 2. Fits the model on training data. + 3. Evaluates the fitted model on test data. + 4. Returns a dictionary with final metrics and (optionally) fitted parameters. + + Args: + program_path: Path to the user's Python script with scaling law functions. + verbose: If True, include fitted parameters and task name in the result. + + Returns: + A dictionary containing the evaluation results. + """ + try: + task_name = resolve_task_name(program_path) + except ValueError as e: + return get_failure_result(str(e)) + + # 1. Fit on training data to get parameters + fit_result = evaluate_core(program_path, task_name, use_test_data=False) + if "fitted_params" not in fit_result: + error = fit_result.get("error", "Unknown fitting error.") + return get_failure_result(f"Fitting failed: {error}") + + fitted_params_map = fit_result["fitted_params"] + + # 2. Evaluate on test data using the fitted parameters + test_result = evaluate_core( + program_path, + task_name, + use_test_data=True, + fitted_params_map=fitted_params_map, + ) + + # 3. Combine results into a comprehensive output + if verbose: + test_result["fitted_params"] = fitted_params_map + test_result["task_name"] = task_name + return test_result + +# --- Script Entrypoint --- + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description="Unified Evaluator for Scaling Law Discovery.") + parser.add_argument("program_path", type=str, help="Path to the Python script with scaling law functions.") + args = parser.parse_args() + + if not os.path.exists(args.program_path): + print(f"Error: Path '{args.program_path}' does not exist.", file=sys.stderr) + sys.exit(1) + + print(f"--- Running Evaluation for Program: {args.program_path} ---") + final_results = evaluate(args.program_path, verbose=True) + + task_name = final_results.get('task_name', 'N/A') + print(f"Inferred Task: {task_name}") + + if "error" in final_results and final_results["error"]: + print("\n--- ⛔ EVALUATION FAILED ⛔ ---") + print(f"Error: {final_results['error']}") + sys.exit(1) + + print("\n--- ✅ Final Test Results (Aggregate) ---") + print(f" Normalized MSE (NMSE): {final_results.get('nmse', 'N/A'):.6f}") + print(f" Normalized MAE (NMAE): {final_results.get('nmae', 'N/A'):.6f}") + print(f" R-squared (R²): {final_results.get('r2', 'N/A'):.6f}") + print(f" Combined Score: {final_results.get('combined_score', 'N/A'):.6f}") + + # Print per-dimension metrics if they exist + if "nmse_per_dim" in final_results: + print("\n --- Per-Dimension Metrics ---") + nmse_vals = final_results["nmse_per_dim"] + nmae_vals = final_results["nmae_per_dim"] + r2_vals = final_results["r2_per_dim"] + for i, (nmse_d, nmae_d, r2_d) in enumerate(zip(nmse_vals, nmae_vals, r2_vals)): + print(f" Dim {i+1}: NMSE={nmse_d:.4f}, NMAE={nmae_d:.4f}, R²={r2_d:.4f}") + + params = final_results.get('fitted_params', {}) + if params: + print(f"\nFitted parameters for {len(params)} group(s):") + for key, val in params.items(): + param_val = np.asarray(val) + if param_val.size > 1: + param_str = np.array2string(param_val, precision=4, max_line_width=80, suppress_small=True) + else: + param_str = f"{param_val.item():.4f}" # Use .item() for single-element arrays + print(f" - Group '{key}': {param_str}") + print("--------------------------") \ No newline at end of file diff --git a/examples/sldbench/init_program.py b/examples/sldbench/init_program.py new file mode 100644 index 000000000..c2611a37a --- /dev/null +++ b/examples/sldbench/init_program.py @@ -0,0 +1,53 @@ +# EVOLVE-BLOCK-START +""" +Scaling law discovery for LLM finetuning scenarios +Initial program with a simple power law form that can be evolved +""" +import numpy as np +from scipy.optimize import minimize + +def scaling_law_func(data_points, params): + + X = np.atleast_2d(np.asarray(data_points)) # (N, F) + N, F = X.shape + params = np.asarray(params) + + if params.ndim == 1: + params = params[None, :] # (1, P) + T, P = params.shape + + coeffs = params[:, :F] # (T, F) + exponents = params[:, F:2*F] # (T, F) + bias = params[:, -1] # (T,) + + pred = (coeffs[None, :, :] * (X[:, None, :] ** exponents[None, :, :])).sum(axis=2) + bias[None, :] + + return pred[:, 0] if pred.shape[1] == 1 else pred + + +def fit_scaling_law(data_points, loss_values): + + X = np.atleast_2d(np.asarray(data_points)) # (N, F) + y = np.asarray(loss_values) + N, F = X.shape + P = 2 * F + 1 + + if y.ndim == 1: + y2d = y[:, None] + else: + y2d = y + T = y2d.shape[1] + + init = np.ones((T, P)) + + def objective(flat_params): + params = flat_params.reshape(T, P) + pred = scaling_law_func(X, params) # (N, T) + mse = np.mean((pred - y2d) ** 2) + return mse + + result = minimize(objective, init.ravel(), method='BFGS') + params_opt = result.x.reshape(T, P) if result.success else init + + return params_opt[0] if T == 1 else params_opt +# EVOLVE-BLOCK-END diff --git a/examples/sldbench/run.sh b/examples/sldbench/run.sh new file mode 100755 index 000000000..fa8b5c249 --- /dev/null +++ b/examples/sldbench/run.sh @@ -0,0 +1,160 @@ +#!/bin/bash + +# Exit immediately if a command exits with a non-zero status. +set -e + +## --- Configuration --- + +# IMPORTANT: Before running this script, ensure your API key environment variable +# is set (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY) + +# API base URL for LLM API calls +API_BASE="${API_BASE:-https://api.openai.com/v1}" + +# Set the maximum number of parallel jobs PER MODEL. +# Usage: ./run.sh 8 +# If no argument is given, it defaults to 4. +PARALLELISM_DEGREE=${1:-4} + +# Array of task configurations to run +tasks=( + # "sft_scaling_law" + # "data_constrained_scaling_law" + # "moe_scaling_law" + # "vocab_scaling_law" + # "domain_mixture_scaling_law" + # "lr_bsz_scaling_law" + # "parallel_scaling_law" + "easy_question_scaling_law" +) + +# Array of models to test +models=( + # "gpt-5" + # "claude-sonnet-4-20250514" + # "gemini-2.5-flash" + "o4-mini" +) + +# Base directory for results +RESULTS_BASE_DIR="./results" + +## --- Graceful Shutdown --- + +# This function is called when the script receives a signal to terminate (e.g., Ctrl+C). +cleanup() { + echo -e "\n🚨 Caught Ctrl+C. Terminating all background jobs..." + # Get the process IDs of all background jobs started by this script. + pids_to_kill=$(jobs -p) + if [ -n "$pids_to_kill" ]; then + # Terminate all the jobs. The '2>/dev/null' suppresses errors + # if a job has already finished. + kill $pids_to_kill 2>/dev/null + fi + echo "All background jobs terminated. Exiting now." + exit 1 +} + +# Set the trap. The 'cleanup' function will be called when the script +# receives an INT (interrupt, from Ctrl+C) or TERM signal. +trap cleanup INT TERM + +## --- Core Logic --- + +# This function encapsulates a single unit of work: one evolution and its evaluation. +run_single_job() { + local task_name=$1 + local model=$2 + local run=$3 + local run_id="run_${run}" + local output_dir="${RESULTS_BASE_DIR}/${task_name}/${model}/${run_id}" + local best_program_path="${output_dir}/best/best_program.py" + local best_eval_log_path="${output_dir}/best_eval.log" + + # --- Skip Condition --- + # A job is complete only if its log file exists AND is not empty. + # The '-s' check handles both missing and empty files. + if [ -s "$best_eval_log_path" ]; then + echo "--- Skipping: [Task: $task_name] [Model: $model] [Run: $run] (valid result exists) ---" + return + fi + + echo "--- Processing: [Task: $task_name] [Model: $model] [Run: $run] ---" + + # Ensure the output directory exists for subsequent steps. + mkdir -p "$output_dir" + + # --- Evolution Step --- + # Run the full evolution process only if the best program does not exist. + if [ ! -f "$best_program_path" ]; then + echo " -> No best program found. Evolving for ${task_name}/${model}/${run_id}..." + EVAL_TASK_NAME="$task_name" uv run openevolve-run \ + --config "configs/${task_name}.yaml" \ + init_program.py evaluator.py \ + --api-base "$API_BASE" \ + --primary-model "$model" \ + --output "$output_dir" + else + echo " -> Best program already exists. Skipping evolution." + fi + + # --- Evaluation Step --- + # After a potential evolution, check for the program again. If it exists, + # run (or re-run) the evaluation. + if [ -f "$best_program_path" ]; then + echo " -> Evaluating best program for ${task_name}/${model}/${run_id}..." + EVAL_TASK_NAME="$task_name" uv run python evaluator.py \ + "$best_program_path" \ + > "$best_eval_log_path" + else + # This case occurs if the evolution step was required but failed to produce a program. + echo " -> WARNING: Evolution failed. No best program found for ${task_name}/${model}/${run_id}. Evaluation skipped." + fi + + echo "--- Finished: [Task: $task_name] [Model: $model] [Run: $run] ---" +} + +## --- Job Orchestration --- + +echo "Starting benchmark with a parallelism degree of $PARALLELISM_DEGREE PER MODEL." +# Calculate and display the total number of jobs that will be executed +total_runs_per_config=3 +total_jobs=$((${#tasks[@]} * ${#models[@]} * total_runs_per_config)) +echo "Total jobs to run: $total_jobs" + +# Loop through each model and start a dedicated, parallelized process for it. +for model in "${models[@]}"; do + # Run the task processing for each model in a separate subshell `(...)` + # and launch that subshell in the background `&`. + # This ensures each model has its own independent job pool. + ( + echo "--- Launching tasks for model [$model] with inner parallelism of [$PARALLELISM_DEGREE] ---" + + # Loop through all tasks and runs for the current model + for task in "${tasks[@]}"; do + for run in $(seq 1 $total_runs_per_config); do + # Check the number of jobs running *within this subshell*. + # This enforces the per-model parallelism limit. + while [[ $(jobs -p | wc -l) -ge $PARALLELISM_DEGREE ]]; do + # Wait for any job *within this subshell* to complete. + wait -n + done + + # Launch the actual job in the background. + run_single_job "$task" "$model" "$run" & + done + done + + # After launching all jobs for this model, wait for them to finish + # before the subshell exits. + wait + echo "--- ✅ All tasks for model [$model] are complete. ---" + ) & +done + +# Final wait at the main script level to ensure all backgrounded model-processing +# subshells have completed. +echo "All model task groups launched. Waiting for all models to complete..." +wait + +echo "✅ All tasks completed!"