@batch per=thread for i in Iter; ...; end</code></pre><p>Use at most 1 thread per physical core, or 1 thread per CPU thread, respectively. One thread per core will mean less threads competing for the cache, while (for example) if there are two hardware threads per physical core, then using each thread means that there are two independent instruction streams feeding the CPU's execution units. When one of these streams isn't enough to make the most of out of order execution, this could increase total throughput.</p><p>Which performs better will depend on the workload, so if you're not sure it may be worth benchmarking both.</p><p>LoopVectorization.jl currently only uses up to 1 thread per physical core. Because there is some overhead to switching the number of threads used, <code>per=core</code> is <code>@batch</code>'s default, so that <code>Polyester.@batch</code> and <code>LoopVectorization.@tturbo</code> work well together by default.</p><p>Threads are not pinned to a given CPU core and the total number of available threads is still governed by <code>--threads</code> or <code>JULIA_NUM_THREADS</code>.</p><p>You can pass both <code>per=(core/thread)</code> and <code>minbatch=N</code> options at the same time, e.g.</p><pre><code class="nohighlight hljs">@batch per=thread minbatch=2000 for i in Iter; ...; end
0 commit comments