|
|
Line 1: |
Line 1: |
− | Thanks to Adam D Ruppe, Bbaz, Ola Fosheim Grostad
| + | Removed at the request of Bbaz. I shall let someone else take this forward |
− | | |
− | Foreach vs for
| |
− | ==============
| |
− | | |
− | foreach is just syntax sugar over a for loop. If there's any
| |
− | allocations, it is because your code had some, it isn't inherit
| |
− | to the loop. The doc definition even lists the translation of
| |
− | foreach to for in the case of ranges explicitly:
| |
− | | |
− | http://dlang.org/statement.html#ForeachStatement
| |
− | | |
− | | |
− | The most likely allocation would be to a user-defined opApply
| |
− | delegate, and you can prevent that by making it opApply(scope
| |
− | your_delegate) - the scope word prevents any closure allocation.
| |
− | | |
− | There is always significant optimization effects in long running
| |
− | loops:
| |
− | - SIMD
| |
− | - cache locality / prefetching
| |
− | | |
− | For the former (SIMD) you need to make sure that good code is
| |
− | generated either by hand, by using vectorized libraries or by
| |
− | auto vectorization.
| |
− | | |
− | For the latter (cache) you need to make sure that the prefetcher
| |
− | is able to predict or is being told to prefetch explicitly and
| |
− | also that the working set is small enough to stay at the faster
| |
− | cache levels.
| |
− | | |
− | If you want good performance you cannot ignore any of these, and
| |
− | you have to design the data structures and algorithms for it.
| |
− | Prefetching has to happen maybe 100 instructions before the
| |
− | actual load from memory and AVX requires byte alignment and a
| |
− | layout that fits the algorithm. On next gen Xeon Skylake I think
| |
− | the alignment might go up to 64 byte and you have 512 bits wide
| |
− | registers (so you can do 8 64 bit floating point operations in
| |
− | parallel per core). The difference between issuing 1-4 ops and
| |
− | issuing 8-16 per time unit is noticable...
| |
− | | |
− | An of course, the closer your code is to theoretical throughput
| |
− | in the CPU, the more critical it becomes to not wait for memory
| |
− | loads.
| |
− | | |
− | This is also a moving target...
| |
− | | |
− | | |
− | opApply vs Range
| |
− | | |
− | =================
| |
− | http://forum.dlang.org/post/mailman.942.1292183237.21107.digitalmars-d-learn@puremagic.com
| |