Skip to content
Jonathan Ragan-Kelley edited this page Sep 26, 2013 · 4 revisions

if I specify a compute_at() but no store_at(), is the default that store_at() always matches compute_at()? (vs "unspecified")

Correct. store_at defaults to the same as compute_at.

if I specify parallel() without any explicit split() on the same function, how are splits chosen for parallelism?

Every instance of the variable is run as its own task. There's no splitting done (or the split is effectively 1). In the default parallel runtime (posix_thread_pool) I tried running tasks in small batches, but the per-task overhead was low enough that it was better just to treat each instance of the var as its own job. That way different cores more tightly interleave their work and you get better L2/L3 usage. If do something silly like parallelize across the innermost dimension (usually x), you get tiny jobs and cores trash each other's caches.

is split() guaranteed to handle non-even multiples of the split factor? e.g.

f.split(y, y, yi, 7); // f.height is not an even multiple of 7

am I risking a runtime out-of-bounds access error, or will this be handled under the covers? (probably worth documenting)

For a pure function, or a reduction initialization step, f.height must be at least 7, or you get an out of bounds error, but it need not be a multiple of 7. The last group of size 7 gets pushed backwards so that it doesn't compute off the end. I.e., for a split of size 4 computing a region of size 6, we compute the elements 0 1 2 3, and then 2 3 4 5.

For a reduction update step, it can change the meaning to recompute the same element multiple times, so it rounds up to a multiple of 7.

For the purpose of bounds inference, a vectorize is just a split, so this behavior is what lets you vectorize x by four even when the width of the input is not a multiple of four.