A statistically rigorous benchmarking harness for Dart. Provides median-based comparisons, coefficient of variation, proper warmup phases, and outlier-resistant measurements for reliable performance analysis.
The standard benchmark_harness package uses mean (average) for measurements, which is sensitive to outliers from GC pauses, OS scheduling, and CPU throttling. This package uses median as the primary metric, providing stable measurements even with occasional outliers.
Sample data with one GC pause: [5.0, 5.1, 4.9, 5.0, 50.0]
Mean: 14.0 us (skewed by outlier)
Median: 5.0 us (accurate representation)
- Median-based comparisons: Robust against outliers
- Coefficient of variation (CV%): Know how reliable your measurements are
- Proper warmup: JIT compilation and cache warming before measurement
- Randomized ordering: Reduces systematic bias from CPU throttling
- Multiple samples: Statistical confidence, not single-shot measurements
- Detailed reporting: Full statistics with reliability assessment
Add to your pubspec.yaml:
dev_dependencies:
benchmark_harness_plus: ^1.0.0import 'package:benchmark_harness_plus/benchmark_harness_plus.dart';
void main() {
final benchmark = Benchmark(
title: 'String Operations',
variants: [
BenchmarkVariant(
name: 'concat',
run: () => 'a' + 'b' + 'c',
),
BenchmarkVariant(
name: 'interpolation',
run: () => '${'a'}${'b'}${'c'}',
),
],
);
final results = benchmark.run(log: print);
printResults(results);
}Output:
[String Operations] Warming up 2 variant(s)...
[String Operations] Collecting 10 sample(s)...
[String Operations] Done.
Variant | median | mean | fastest | stddev | cv% | vs base
------------------------------------------------------------------------------------
concat | 0.42 | 0.43 | 0.40 | 0.02 | 4.7 | -
interpolation | 0.38 | 0.39 | 0.36 | 0.01 | 3.2 | 1.11x
(times in microseconds per operation)
Use predefined configurations or create custom ones:
// Quick feedback during development (less accurate)
Benchmark(..., config: BenchmarkConfig.quick);
// Standard benchmarking (default)
Benchmark(..., config: BenchmarkConfig.standard);
// Important performance decisions (more accurate)
Benchmark(..., config: BenchmarkConfig.thorough);
// Custom configuration
Benchmark(..., config: BenchmarkConfig(
iterations: 5000, // Iterations per sample
samples: 15, // Number of samples to collect
warmupIterations: 1000,
randomizeOrder: true, // Randomize variant order
));CV% normalizes variance across different scales. It tells you how reliable your measurements are:
| CV% | Reliability | Interpretation |
|---|---|---|
| < 10% | Excellent | Highly reliable, trust exact ratios |
| 10-20% | Good | Rankings are reliable |
| 20-50% | Moderate | Directional only, do not trust exact ratios |
| > 50% | Poor | Unreliable, measurement is mostly noise |
final result = benchmark.run().first;
print('Reliability: ${result.reliability}'); // excellent, good, moderate, or poor- Look at CV% first: If > 20%, treat comparisons as directional only
- Compare medians: This is your primary metric
- Check mean vs median: Large difference indicates outliers
- Look at the ratio: 1.42x means 42% faster than baseline
final results = benchmark.run();
// Detailed report for a single result
print(formatDetailedResult(results[0]));
// Compare two variants
final comparison = BenchmarkComparison(
baseline: results[0],
test: results[1],
);
print('Speedup: ${comparison.speedup.toStringAsFixed(2)}x');
print('Improvement: ${comparison.improvementPercent.toStringAsFixed(1)}%');
print('Reliable: ${comparison.isReliable}');
// Export as CSV
final csv = formatResultsAsCsv(results);
File('results.csv').writeAsStringSync(csv);The package exports individual statistical functions for custom analysis:
import 'package:benchmark_harness_plus/benchmark_harness_plus.dart';
final samples = [10.0, 11.0, 9.5, 10.2, 10.1];
print('Mean: ${mean(samples)}');
print('Median: ${median(samples)}');
print('Stddev: ${stdDev(samples)}');
print('CV%: ${cv(samples)}');
print('Range: ${min(samples)} - ${max(samples)}');
print('Reliability: ${reliabilityFromCV(cv(samples))}');- Use enough samples: Minimum 10, prefer 20 for important decisions
- Use enough iterations: Each sample should take at least 10ms
- Warm up properly: JIT needs time to optimize hot paths
- Report CV%: Always show measurement stability
- Use median for comparisons: More robust than mean
- Re-run when in doubt: If results seem surprising, verify with another run
- Sub-microsecond measurements: Inherently noisy, expect CV% > 50%
- First run bias: Always warm up before measuring
- Order effects: Randomize variant order across samples (enabled by default)
- Single sample: Never trust a single measurement
BENCHMARKING_GUIDE.md - In-depth explanation of:
- The statistical foundations behind each metric
- Benefits and downsides of mean, median, stddev, and CV%
- How to interpret results correctly
- What to do when measurements are unreliable
- How to choose the right configuration
MIGRATION_GUIDE.md - Migrating from benchmark_harness:
- Side-by-side code comparisons
- Step-by-step migration instructions
- Common migration patterns
- What you gain by switching
MIT License. See LICENSE file for details.