www.lampel.net / johannes / projects / bifurk

Johannes Lampel - Projects/bifurk

A simple SIMD application example using the nVec class for calculating f(n+1) = f(n) * (1-f(n) ) * u ( where u is the x axis on the following picture and f(n) the y axis ), the so called logistic growth function.
Since all iterations are the same for each u, a vector is intialized with all those different u's and then a couple of adjacent pixels of the following pixels are calculated, according to the dimensionality of the vector. Since a lot of those operations can be done inside the caches, therefore the memory transfer rate to the RAM isn't that important, the calculation can be speed up by more than 3 times. Don't use too small vector sizes, since this would increase the overhead and lower the speedup which could have been done by using SSE. But keep the data inside your cache. The SOM Simulator had e.g. such big data sets that the SSE speedup wasn't noticeable, because the bottleneck was the memory transfer from the CPU to the RAM and vice versa.

The following picture shows the performance with different vector sizes, for the usual FLPT calculations and those using SSE. For the calculations I used 32bit floating point numbers, the system was a Pentium4 2.6Ghz.
With one dimensional vectors, the performance of the FLPT and the SSE version is the same, as expected. At the graph of the FLPT calculation we can see a significant drop from 12 to 14, i.e. at a vector size right above 2^12 * 4 byte = 16kByte, which is the size of the P4's L1 data cache. The maximum of the SSE graph might have it's reason in the page size of 4kB. At a vector size of 2^9 * 4 byte we have a speedup of more than the theoretical possible 4 times ( SSE does 4 FLPT operations in parallel ) The reason could be that I used _aligned_malloc for the SSE memory allocations, while I used the unaligned new for the standard FLPT routines. Aligned allocation is better regarding cacheline usage.

Source : bifurk.cpp