Skip to content

Commit 4a1d461

Browse files
committedJan 1, 2025·
Faster work stealing iterator
1 parent 177fc32 commit 4a1d461

File tree

9 files changed

+183
-74
lines changed

9 files changed

+183
-74
lines changed
 

‎README.md

+8-8
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ Place input files in `input/yearYYYY/dayDD.txt` including leading zeroes. For ex
5353
## Performance
5454

5555
Benchmarks are measured using the built-in `cargo bench` tool run on an [Apple M2 Max][apple-link].
56-
All 250 solutions from 2024 to 2015 complete sequentially in **585 milliseconds**.
56+
All 250 solutions from 2024 to 2015 complete sequentially in **584 milliseconds**.
5757
Interestingly 84% of the total time is spent on just 9 solutions.
5858
Performance is reasonable even on older hardware, for example a 2011 MacBook Pro with an
5959
[Intel i7-2720QM][intel-link] processor takes 3.5 seconds to run the same 225 solutions.
@@ -62,7 +62,7 @@ Performance is reasonable even on older hardware, for example a 2011 MacBook Pro
6262

6363
| Year | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 |
6464
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
65-
| Benchmark (ms) | 24 | 120 | 89 | 35 | 16 | 272 | 9 | 8 | 6 | 6 |
65+
| Benchmark (ms) | 24 | 120 | 89 | 35 | 16 | 272 | 9 | 8 | 6 | 5 |
6666

6767
## 2024
6868

@@ -75,7 +75,7 @@ Performance is reasonable even on older hardware, for example a 2011 MacBook Pro
7575
| 3 | [Mull It Over](https://adventofcode.com/2024/day/3) | [Source](src/year2024/day03.rs) | 8 |
7676
| 4 | [Ceres Search](https://adventofcode.com/2024/day/4) | [Source](src/year2024/day04.rs) | 77 |
7777
| 5 | [Print Queue](https://adventofcode.com/2024/day/5) | [Source](src/year2024/day05.rs) | 18 |
78-
| 6 | [Guard Gallivant](https://adventofcode.com/2024/day/6) | [Source](src/year2024/day06.rs) | 386 |
78+
| 6 | [Guard Gallivant](https://adventofcode.com/2024/day/6) | [Source](src/year2024/day06.rs) | 331 |
7979
| 7 | [Bridge Repair](https://adventofcode.com/2024/day/7) | [Source](src/year2024/day07.rs) | 136 |
8080
| 8 | [Resonant Collinearity](https://adventofcode.com/2024/day/8) | [Source](src/year2024/day08.rs) | 8 |
8181
| 9 | [Disk Fragmenter](https://adventofcode.com/2024/day/9) | [Source](src/year2024/day09.rs) | 106 |
@@ -89,9 +89,9 @@ Performance is reasonable even on older hardware, for example a 2011 MacBook Pro
8989
| 17 | [Chronospatial Computer](https://adventofcode.com/2024/day/17) | [Source](src/year2024/day17.rs) | 2 |
9090
| 18 | [RAM Run](https://adventofcode.com/2024/day/18) | [Source](src/year2024/day18.rs) | 42 |
9191
| 19 | [Linen Layout](https://adventofcode.com/2024/day/19) | [Source](src/year2024/day19.rs) | 118 |
92-
| 20 | [Race Condition](https://adventofcode.com/2024/day/20) | [Source](src/year2024/day20.rs) | 1354 |
92+
| 20 | [Race Condition](https://adventofcode.com/2024/day/20) | [Source](src/year2024/day20.rs) | 1038 |
9393
| 21 | [Keypad Conundrum](https://adventofcode.com/2024/day/21) | [Source](src/year2024/day21.rs) | 19 |
94-
| 22 | [Monkey Market](https://adventofcode.com/2024/day/22) | [Source](src/year2024/day22.rs) | 1350 |
94+
| 22 | [Monkey Market](https://adventofcode.com/2024/day/22) | [Source](src/year2024/day22.rs) | 1216 |
9595
| 23 | [LAN Party](https://adventofcode.com/2024/day/23) | [Source](src/year2024/day23.rs) | 43 |
9696
| 24 | [Crossed Wires](https://adventofcode.com/2024/day/24) | [Source](src/year2024/day24.rs) | 23 |
9797
| 25 | [Code Chronicle](https://adventofcode.com/2024/day/25) | [Source](src/year2024/day25.rs) | 8 |
@@ -113,7 +113,7 @@ Performance is reasonable even on older hardware, for example a 2011 MacBook Pro
113113
| 9 | [Mirage Maintenance](https://adventofcode.com/2023/day/9) | [Source](src/year2023/day09.rs) | 18 |
114114
| 10 | [Pipe Maze](https://adventofcode.com/2023/day/10) | [Source](src/year2023/day10.rs) | 41 |
115115
| 11 | [Cosmic Expansion](https://adventofcode.com/2023/day/11) | [Source](src/year2023/day11.rs) | 12 |
116-
| 12 | [Hot Springs](https://adventofcode.com/2023/day/12) | [Source](src/year2023/day12.rs) | 440 |
116+
| 12 | [Hot Springs](https://adventofcode.com/2023/day/12) | [Source](src/year2023/day12.rs) | 387 |
117117
| 13 | [Point of Incidence](https://adventofcode.com/2023/day/13) | [Source](src/year2023/day13.rs) | 66 |
118118
| 14 | [Parabolic Reflector Dish](https://adventofcode.com/2023/day/14) | [Source](src/year2023/day14.rs) | 632 |
119119
| 15 | [Lens Library](https://adventofcode.com/2023/day/15) | [Source](src/year2023/day15.rs) | 84 |
@@ -183,7 +183,7 @@ Performance is reasonable even on older hardware, for example a 2011 MacBook Pro
183183
| 15 | [Chiton](https://adventofcode.com/2021/day/15) | [Source](src/year2021/day15.rs) | 2403 |
184184
| 16 | [Packet Decoder](https://adventofcode.com/2021/day/16) | [Source](src/year2021/day16.rs) | 6 |
185185
| 17 | [Trick Shot](https://adventofcode.com/2021/day/17) | [Source](src/year2021/day17.rs) | 7 |
186-
| 18 | [Snailfish](https://adventofcode.com/2021/day/18) | [Source](src/year2021/day18.rs) | 501 |
186+
| 18 | [Snailfish](https://adventofcode.com/2021/day/18) | [Source](src/year2021/day18.rs) | 404 |
187187
| 19 | [Beacon Scanner](https://adventofcode.com/2021/day/19) | [Source](src/year2021/day19.rs) | 615 |
188188
| 20 | [Trench Map](https://adventofcode.com/2021/day/20) | [Source](src/year2021/day20.rs) | 2066 |
189189
| 21 | [Dirac Dice](https://adventofcode.com/2021/day/21) | [Source](src/year2021/day21.rs) | 278 |
@@ -272,7 +272,7 @@ Performance is reasonable even on older hardware, for example a 2011 MacBook Pro
272272
| 8 | [Memory Maneuver](https://adventofcode.com/2018/day/8) | [Source](src/year2018/day08.rs) | 24 |
273273
| 9 | [Marble Mania](https://adventofcode.com/2018/day/9) | [Source](src/year2018/day09.rs) | 909 |
274274
| 10 | [The Stars Align](https://adventofcode.com/2018/day/10) | [Source](src/year2018/day10.rs) | 11 |
275-
| 11 | [Chronal Charge](https://adventofcode.com/2018/day/11) | [Source](src/year2018/day11.rs) | 1404 |
275+
| 11 | [Chronal Charge](https://adventofcode.com/2018/day/11) | [Source](src/year2018/day11.rs) | 1156 |
276276
| 12 | [Subterranean Sustainability](https://adventofcode.com/2018/day/12) | [Source](src/year2018/day12.rs) | 77 |
277277
| 13 | [Mine Cart Madness](https://adventofcode.com/2018/day/13) | [Source](src/year2018/day13.rs) | 382 |
278278
| 14 | [Chocolate Charts](https://adventofcode.com/2018/day/14) | [Source](src/year2018/day14.rs) | 24000 |

‎docs/pie-2024.svg

+11-11
Loading

‎src/util/thread.rs

+133-20
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,18 @@
22
//! [scoped](https://doc.rust-lang.org/stable/std/thread/fn.scope.html)
33
//! threads equals to the number of cores on the machine. Unlike normal threads, scoped threads
44
//! can borrow data from their environment.
5+
use std::sync::atomic::{AtomicUsize, Ordering::Relaxed};
56
use std::thread::*;
67

8+
// Usually the number of physical cores.
9+
fn threads() -> usize {
10+
available_parallelism().unwrap().get()
11+
}
12+
713
/// Spawn `n` scoped threads, where `n` is the available parallelism.
8-
pub fn spawn<F, T>(f: F)
14+
pub fn spawn<F>(f: F)
915
where
10-
F: FnOnce() -> T + Copy + Send,
11-
T: Send,
16+
F: Fn() + Copy + Send,
1217
{
1318
scope(|scope| {
1419
for _ in 0..threads() {
@@ -17,31 +22,139 @@ where
1722
});
1823
}
1924

20-
/// Splits `items` into batches, one per thread. Items are assigned in a round robin fashion,
21-
/// to achieve a crude load balacing in case some items are more complex to process than others.
22-
pub fn spawn_batches<F, T, U>(mut items: Vec<U>, f: F)
25+
/// Spawns `n` scoped threads that each receive a
26+
/// [work stealing](https://en.wikipedia.org/wiki/Work_stealing) iterator.
27+
/// Work stealing is an efficient strategy that keeps each CPU core busy when some items take longer
28+
/// than other to process, used by popular libraries such as [rayon](https://github.com/rayon-rs/rayon).
29+
/// Processing at different rates also happens on many modern CPUs with
30+
/// [heterogeneous performance and efficiency cores](https://en.wikipedia.org/wiki/ARM_big.LITTLE).
31+
pub fn spawn_parallel_iterator<F, T>(items: &[T], f: F)
2332
where
24-
F: FnOnce(Vec<U>) -> T + Copy + Send,
25-
T: Send,
26-
U: Send,
33+
F: Fn(ParIter<'_, T>) + Copy + Send,
34+
T: Sync,
2735
{
2836
let threads = threads();
29-
let mut batches: Vec<_> = (0..threads).map(|_| Vec::new()).collect();
30-
let mut index = 0;
37+
let size = items.len().div_ceil(threads);
3138

32-
// Round robin items over each thread.
33-
while let Some(next) = items.pop() {
34-
batches[index % threads].push(next);
35-
index += 1;
36-
}
39+
// Initially divide work as evenly as possible amongst each worker thread.
40+
let workers: Vec<_> = (0..threads)
41+
.map(|id| {
42+
let start = (id * size).min(items.len());
43+
let end = (start + size).min(items.len());
44+
CachePadding::new(pack(start, end))
45+
})
46+
.collect();
47+
let workers = workers.as_slice();
3748

3849
scope(|scope| {
39-
for batch in batches {
40-
scope.spawn(move || f(batch));
50+
for id in 0..threads {
51+
scope.spawn(move || f(ParIter { id, items, workers }));
4152
}
4253
});
4354
}
4455

45-
fn threads() -> usize {
46-
available_parallelism().unwrap().get()
56+
pub struct ParIter<'a, T> {
57+
id: usize,
58+
items: &'a [T],
59+
workers: &'a [CachePadding],
60+
}
61+
62+
impl<'a, T> Iterator for ParIter<'a, T> {
63+
type Item = &'a T;
64+
65+
fn next(&mut self) -> Option<&'a T> {
66+
// First try taking from our own queue.
67+
let worker = &self.workers[self.id];
68+
let current = worker.increment();
69+
let (start, end) = unpack(current);
70+
71+
// There's still items to process.
72+
if start < end {
73+
return Some(&self.items[start]);
74+
}
75+
76+
// Steal from another worker, [spinlocking](https://en.wikipedia.org/wiki/Spinlock)
77+
// until we acquire new items to process or there's nothing left to do.
78+
loop {
79+
// Find worker with the most remaining items.
80+
let available = self
81+
.workers
82+
.iter()
83+
.filter_map(|other| {
84+
let current = other.load();
85+
let (start, end) = unpack(current);
86+
let size = end.saturating_sub(start);
87+
88+
(size > 0).then_some((other, current, size))
89+
})
90+
.max_by_key(|t| t.2);
91+
92+
if let Some((other, current, size)) = available {
93+
// Split the work items into two roughly equal piles.
94+
let (start, end) = unpack(current);
95+
let middle = start + size.div_ceil(2);
96+
97+
let next = pack(middle, end);
98+
let stolen = pack(start + 1, middle);
99+
100+
// We could be preempted by another thread stealing or by the owning worker
101+
// thread finishing an item, so check indices are still unmodified.
102+
if other.compare_exchange(current, next) {
103+
worker.store(stolen);
104+
break Some(&self.items[start]);
105+
}
106+
} else {
107+
// No work remaining.
108+
break None;
109+
}
110+
}
111+
}
112+
}
113+
114+
/// Intentionally force alignment to 128 bytes to make a best effort attempt to place each atomic
115+
/// on its own cache line. This reduces contention and improves performance for common
116+
/// CPU caching protocols such as [MESI](https://en.wikipedia.org/wiki/MESI_protocol).
117+
#[repr(align(128))]
118+
pub struct CachePadding {
119+
atomic: AtomicUsize,
120+
}
121+
122+
/// Convenience wrapper methods around atomic operations. Both start and end indices are packed
123+
/// into a single atomic so that we can use the fastest and easiest to reason about `Relaxed`
124+
/// ordering.
125+
impl CachePadding {
126+
#[inline]
127+
fn new(n: usize) -> Self {
128+
CachePadding { atomic: AtomicUsize::new(n) }
129+
}
130+
131+
#[inline]
132+
fn increment(&self) -> usize {
133+
self.atomic.fetch_add(1, Relaxed)
134+
}
135+
136+
#[inline]
137+
fn load(&self) -> usize {
138+
self.atomic.load(Relaxed)
139+
}
140+
141+
#[inline]
142+
fn store(&self, n: usize) {
143+
self.atomic.store(n, Relaxed);
144+
}
145+
146+
#[inline]
147+
fn compare_exchange(&self, current: usize, new: usize) -> bool {
148+
self.atomic.compare_exchange(current, new, Relaxed, Relaxed).is_ok()
149+
}
150+
}
151+
152+
#[inline]
153+
fn pack(start: usize, end: usize) -> usize {
154+
(end << 32) | start
155+
}
156+
157+
#[inline]
158+
fn unpack(both: usize) -> (usize, usize) {
159+
(both & 0xffffffff, both >> 32)
47160
}

‎src/year2018/day11.rs

+6-12
Original file line numberDiff line numberDiff line change
@@ -43,15 +43,10 @@ pub fn parse(input: &str) -> Vec<Result> {
4343
}
4444

4545
// Use as many cores as possible to parallelize the search.
46-
// Smaller sizes take more time so keep batches roughly the same effort so that some
47-
// threads are not finishing too soon and waiting idle, while others are still busy.
48-
// For example if there are 4 cores, then they will be assigned sizes:
49-
// * 1, 5, 9, ..
50-
// * 2, 6, 10, ..
51-
// * 3, 7, 11, ..
52-
// * 4, 8, 12, ..
46+
// Smaller sizes take more time so use work stealing to keep all cores busy.
47+
let items: Vec<_> = (1..301).collect();
5348
let shared = Shared { sat, mutex: Mutex::new(Vec::new()) };
54-
spawn_batches((1..301).collect(), |batch| worker(&shared, batch));
49+
spawn_parallel_iterator(&items, |iter| worker(&shared, iter));
5550
shared.mutex.into_inner().unwrap()
5651
}
5752

@@ -65,10 +60,9 @@ pub fn part2(input: &[Result]) -> String {
6560
format!("{x},{y},{size}")
6661
}
6762

68-
fn worker(shared: &Shared, batch: Vec<usize>) {
69-
let result: Vec<_> = batch
70-
.into_iter()
71-
.map(|size| {
63+
fn worker(shared: &Shared, iter: ParIter<'_, usize>) {
64+
let result: Vec<_> = iter
65+
.map(|&size| {
7266
let (power, x, y) = square(&shared.sat, size);
7367
Result { x, y, size, power }
7468
})

‎src/year2021/day18.rs

+4-5
Original file line numberDiff line numberDiff line change
@@ -83,18 +83,17 @@ pub fn part2(input: &[Snailfish]) -> i32 {
8383
}
8484
}
8585

86-
// Use as many cores as possible to parallelize the calculation,
87-
// breaking the work into roughly equally size batches.
86+
// Use as many cores as possible to parallelize the calculation.
8887
let shared = AtomicI32::new(0);
89-
spawn_batches(pairs, |batch| worker(&shared, &batch));
88+
spawn_parallel_iterator(&pairs, |iter| worker(&shared, iter));
9089
shared.load(Ordering::Relaxed)
9190
}
9291

9392
/// Pair addition is independent so we can parallelize across multiple threads.
94-
fn worker(shared: &AtomicI32, batch: &[(&Snailfish, &Snailfish)]) {
93+
fn worker(shared: &AtomicI32, iter: ParIter<'_, (&Snailfish, &Snailfish)>) {
9594
let mut partial = 0;
9695

97-
for (a, b) in batch {
96+
for (a, b) in iter {
9897
partial = partial.max(magnitude(&mut add(a, b)));
9998
}
10099

‎src/year2023/day12.rs

+9-7
Original file line numberDiff line numberDiff line change
@@ -137,29 +137,31 @@ pub fn parse(input: &str) -> Vec<Spring<'_>> {
137137
}
138138

139139
pub fn part1(input: &[Spring<'_>]) -> u64 {
140-
solve(input, 1)
140+
solve(input.iter(), 1)
141141
}
142142

143143
pub fn part2(input: &[Spring<'_>]) -> u64 {
144-
// Use as many cores as possible to parallelize the calculation,
145-
// breaking the work into roughly equally size batches.
144+
// Use as many cores as possible to parallelize the calculation.
146145
let shared = AtomicU64::new(0);
147-
spawn_batches(input.to_vec(), |batch| {
148-
let partial = solve(&batch, 5);
146+
spawn_parallel_iterator(input, |iter| {
147+
let partial = solve(iter, 5);
149148
shared.fetch_add(partial, Ordering::Relaxed);
150149
});
151150
shared.load(Ordering::Relaxed)
152151
}
153152

154-
pub fn solve(input: &[Spring<'_>], repeat: usize) -> u64 {
153+
pub fn solve<'a, I>(iter: I, repeat: usize) -> u64
154+
where
155+
I: Iterator<Item = &'a Spring<'a>>,
156+
{
155157
let mut result = 0;
156158
let mut pattern = Vec::new();
157159
let mut springs = Vec::new();
158160
// Exact size is not too important as long as there's enough space.
159161
let mut broken = vec![0; 200];
160162
let mut table = vec![0; 200 * 50];
161163

162-
for (first, second) in input {
164+
for (first, second) in iter {
163165
// Create input sequence reusing the buffers to minimize memory allocations.
164166
pattern.clear();
165167
springs.clear();

‎src/year2024/day06.rs

+3-4
Original file line numberDiff line numberDiff line change
@@ -78,14 +78,13 @@ pub fn part2(grid: &Grid<u8>) -> usize {
7878
let shortcut = Shortcut::from(&grid);
7979
let total = AtomicUsize::new(0);
8080

81-
spawn_batches(path, |batch| worker(&shortcut, &total, &batch));
81+
spawn_parallel_iterator(&path, |iter| worker(&shortcut, &total, iter));
8282
total.into_inner()
8383
}
8484

85-
fn worker(shortcut: &Shortcut, total: &AtomicUsize, batch: &[(Point, Point)]) {
85+
fn worker(shortcut: &Shortcut, total: &AtomicUsize, iter: ParIter<'_, (Point, Point)>) {
8686
let mut seen = FastSet::new();
87-
let result = batch
88-
.iter()
87+
let result = iter
8988
.filter(|(position, direction)| {
9089
seen.clear();
9190
is_cycle(shortcut, &mut seen, *position, *direction)

‎src/year2024/day20.rs

+4-3
Original file line numberDiff line numberDiff line change
@@ -98,16 +98,17 @@ pub fn part2(time: &Grid<i32>) -> u32 {
9898
}
9999
}
100100

101+
// Use as many cores as possible to parallelize the remaining search.
101102
let total = AtomicU32::new(0);
102-
spawn_batches(items, |batch| worker(time, &total, batch));
103+
spawn_parallel_iterator(&items, |iter| worker(time, &total, iter));
103104
total.into_inner()
104105
}
105106

106-
fn worker(time: &Grid<i32>, total: &AtomicU32, batch: Vec<Point>) {
107+
fn worker(time: &Grid<i32>, total: &AtomicU32, iter: ParIter<'_, Point>) {
107108
let mut cheats = 0;
108109

109110
// (p1, p2) is the reciprocal of (p2, p1) so we only need to check each pair once.
110-
for point in batch {
111+
for &point in iter {
111112
for x in 2..21 {
112113
cheats += check(time, point, Point::new(x, 0));
113114
}

‎src/year2024/day22.rs

+5-4
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,11 @@ struct Exclusive {
2929
}
3030

3131
pub fn parse(input: &str) -> Input {
32-
let numbers = input.iter_unsigned().collect();
32+
let numbers: Vec<_> = input.iter_unsigned().collect();
3333
let mutex = Mutex::new(Exclusive { part_one: 0, part_two: vec![0; 130321] });
3434

35-
spawn_batches(numbers, |batch| worker(&mutex, &batch));
35+
// Use as many cores as possible to parallelize the remaining search.
36+
spawn_parallel_iterator(&numbers, |iter| worker(&mutex, iter));
3637

3738
let Exclusive { part_one, part_two } = mutex.into_inner().unwrap();
3839
(part_one, *part_two.iter().max().unwrap())
@@ -46,12 +47,12 @@ pub fn part2(input: &Input) -> u16 {
4647
input.1
4748
}
4849

49-
fn worker(mutex: &Mutex<Exclusive>, batch: &[usize]) {
50+
fn worker(mutex: &Mutex<Exclusive>, iter: ParIter<'_, usize>) {
5051
let mut part_one = 0;
5152
let mut part_two = vec![0; 130321];
5253
let mut seen = vec![u16::MAX; 130321];
5354

54-
for (id, number) in batch.iter().enumerate() {
55+
for (id, number) in iter.enumerate() {
5556
let id = id as u16;
5657

5758
let zeroth = *number;

0 commit comments

Comments
 (0)
Please sign in to comment.