Efficientcodesnippets draft in progress: useful snippets for the D newcomer on writing efficient code
Thanks to Adam D Ruppe, Bbaz, Ola Fosheim Grostad
Foreach vs for
foreach is just syntax sugar over a for loop. If there's any allocations, it is because your code had some, it isn't inherit to the loop. The doc definition even lists the translation of foreach to for in the case of ranges explicitly:
The most likely allocation would be to a user-defined opApply delegate, and you can prevent that by making it opApply(scope your_delegate) - the scope word prevents any closure allocation.
There is always significant optimization effects in long running loops: - SIMD - cache locality / prefetching
For the former (SIMD) you need to make sure that good code is generated either by hand, by using vectorized libraries or by auto vectorization.
For the latter (cache) you need to make sure that the prefetcher is able to predict or is being told to prefetch explicitly and also that the working set is small enough to stay at the faster cache levels.
If you want good performance you cannot ignore any of these, and you have to design the data structures and algorithms for it. Prefetching has to happen maybe 100 instructions before the actual load from memory and AVX requires byte alignment and a layout that fits the algorithm. On next gen Xeon Skylake I think the alignment might go up to 64 byte and you have 512 bits wide registers (so you can do 8 64 bit floating point operations in parallel per core). The difference between issuing 1-4 ops and issuing 8-16 per time unit is noticable...
An of course, the closer your code is to theoretical throughput in the CPU, the more critical it becomes to not wait for memory loads.
This is also a moving target...
opApply vs Range