Using Automatic Vectorization

The automatic vectorizer (also called the auto-vectorizer) is a component of the Intel® compiler that automatically uses SIMD instructions in the Intel® Streaming SIMD Extensions (Intel® SSE, SSE2, SSE3 and SSE4 Vectorizing Compiler and Media Accelerators), and the Supplemental Streaming SIMD Extensions (SSSE3) instruction sets, and the Intel® Advanced Vector Extension instruction set. The vectorizer detects operations in the program that can be done in parallel, and then converts the sequential operations, like one SIMD instruction that processes 2, 4, 8 or up to 16 elements, to parallel, depending on the data type.

So, what is vectorization? The process of converting an algorithm from a scalar implementation, which does an operation one pair of operands at a time, to a vector process where a single instruction can refer to a vector (series of adjacent values) is called vectorization. SIMD instructions operate on multiple data elements in one instruction and make use of the 128-bit SIMD floating-point registers.

Automatic vectorization occurs when the Intel® Compiler generates packed SIMD instructions to unroll a loop. Because the packed instructions operate on more than one data element at a time, the loop executes more efficiently. This process is referred to as auto-vectorization only to emphasize that the compiler identifies and optimizes suitable loops on its own, without requiring any special action by you. However, it is useful to note that in some cases, certain keywords or directives may be applied in the code for auto-vectorization to occur.

Automatic vectorization is supported on IA-32 and Intel® 64 architectures.

Using the -vec (Linux and Mac OS* X) or the /Qvec (Windows*) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx (Windows) or -m or -x (Linux and Mac OS X).

Vectorization Speed-up

Where does the vectorization speedup come from? Consider the following sample code fragment, where a, b and c are integer arrays:for (i=0;i<=MAX;i++)
c[i]=a[i]+b[i];

If vectorization is not enabled (that is, you compile using /O1 or /Qvec- option), for each iteration, the compiler processes the code such that there is a lot of unused space in the SIMD registers, even though each of the registers could hold three additional integers. If vectorization is enabled (you compile using /O2 or higher options), the compiler may use the additional registers to perform four additions in a single instruction. The compiler looks for vectorization opportunities whenever you compile at default optimization (-O2) or higher.

Tip iconTip

To allow comparisons between vectorized and not-vectorized code, disable vectorization using the /Qvec- (Windows*) or -no-vec (Linux* or Mac OS* X) option; enable vectorization using the /O2 or -O2 option.

To get information whether a loop was vectorized or not, enable generation of the vectorization report using the /Qopt-report:1 /Qopt-report-phase hpo (Windows) or –opt-report1 –opt-report-phase hpo (Linux or Mac OS X) option. You will get a one line message for every loop that is vectorized, as follows:> icl /Qvec-report1 MultArray.c
MultArray.c(92): (col. 5) remark: LOOP WAS VECTORIZED.

The source line number (92 in the above example) refers to either the beginning or the end of the loop.

So, how significant is the performance enhancement? To evaluate performance enhancement yourself, run example1:

  1. Open an Intel® Compiler command line window. Source an environment script such as iccvars_intel64.sh in the compiler bin/intel64 directory, or iccvars_ia32.sh in the bin/ia32 directory, as appropriate.
  2. Navigate to the \example1 directory. The small application multiplies a vector by a matrix using the following loop:
    for (j = 0;j < size2; j++) {
          b[i] += a[i][j] * x[j];
    }
  3. Build and run the application, first without enabling auto-vectorization. Note the time taken for the application to run. On Linux* and MacOS*X platforms, enter:
    icc -O2 -no-vec  MultArray.c -o NoVectMult
    ./NoVectMult
  4. Now build and run the application, this time enabling auto-vectorization. Note the time taken for the application to run. On Linux* and MacOS*X platforms, enter:
    icc -O2 -vec-report1 MultArray.c -o VectMult
    ./VectMult

When you compare the timing of the two runs, you may see that the vectorized version runs faster. The time for the non-vectorized version is only slightly faster than would be obtained by compiling with the /O1 or the -O1 option.

Obstacles to Vectorization

The following do not always prevent vectorization, but frequently either prevent it or cause the compiler to decide that vectorization would not be worthwhile.

Helping the Intel® Compiler to Vectorize

Sometimes the compiler has insufficient information to decide to vectorize a loop. There are several ways to provide additional information to the compiler:


Submit feedback on this help topic

Copyright © 1996-2011, Intel Corporation. All rights reserved.