The previous examples made use of double precision arrays. They may be built instead with single precision arrays by changing the command-line option -real-size 64 to -real-size 32 The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements.
In the example with data alignment, you will need to set ROWBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, the directive !dir$ vector aligned will cause the program to fail.
This completes the tutorial for auto-vectorization, where you have seen how the compiler can optimize performance with various vectorization techniques.
Copyright © 2011, Intel Corporation. All rights reserved.