The automatic vectorizer (also called the auto-vectorizer) is a component of the Intel® compiler that automatically uses SIMD instructions in the Intel® Streaming SIMD Extensions (Intel® SSE, SSE2, SSE3 and SSE4 Vectorizing Compiler and Media Accelerators), and the Supplemental Streaming SIMD Extensions (SSSE3) instruction sets, and the Intel® Advanced Vector Extension instruction set. The vectorizer detects operations in the program that can be done in parallel, and then converts the sequential operations, like one SIMD instruction that processes 2, 4, 8 or up to 16 elements, to parallel, depending on the data type.
So, what is vectorization? The process of converting an algorithm from a scalar implementation, which does an operation one pair of operands at a time, to a vector process where a single instruction can refer to a vector (series of adjacent values) is called vectorization. SIMD instructions operate on multiple data elements in one instruction and make use of the 128-bit SIMD floating-point registers.
Automatic vectorization occurs when the Intel® Compiler generates packed SIMD instructions to unroll a loop. Because the packed instructions operate on more than one data element at a time, the loop executes more efficiently. This process is referred to as auto-vectorization only to emphasize that the compiler identifies and optimizes suitable loops on its own, without requiring any special action by you. However, it is useful to note that in some cases, certain keywords or directives may be applied in the code for auto-vectorization to occur.
Automatic vectorization is supported on IA-32 and Intel® 64 architectures.
Using the -vec (Linux and Mac OS* X) or the /Qvec (Windows*) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx (Windows) or -m or -x (Linux and Mac OS X).
Where does the vectorization speedup come from? Consider the following sample code fragment, where a, b and c are integer arrays:for (i=0;i<=MAX;i++)
c[i]=a[i]+b[i];
If vectorization is not enabled (that is, you compile using /O1 or /Qvec- option), for each iteration, the compiler processes the code such that there is a lot of unused space in the SIMD registers, even though each of the registers could hold three additional integers. If vectorization is enabled (you compile using /O2 or higher options), the compiler may use the additional registers to perform four additions in a single instruction. The compiler looks for vectorization opportunities whenever you compile at default optimization (-O2) or higher.
To allow comparisons between vectorized and not-vectorized code, disable vectorization using the /Qvec- (Windows*) or -no-vec (Linux* or Mac OS* X) option; enable vectorization using the /O2 or -O2 option.
To get information whether a loop was vectorized or not, enable generation of the vectorization report using the /Qopt-report:1 /Qopt-report-phase hpo (Windows) or –opt-report1 –opt-report-phase hpo (Linux or Mac OS X) option. You will get a one line message for every loop that is vectorized, as follows:> icl /Qvec-report1 MultArray.c
MultArray.c(92): (col. 5) remark: LOOP WAS VECTORIZED.
The source line number (92 in the above example) refers to either the beginning or the end of the loop.
So, how significant is the performance enhancement? To evaluate performance enhancement yourself, run example1:
for (j = 0;j < size2; j++) {
b[i] += a[i][j] * x[j];
}
icc -O2 -no-vec MultArray.c -o NoVectMult
./NoVectMult
icc -O2 -vec-report1 MultArray.c -o VectMult
./VectMult
When you compare the timing of the two runs, you may see that the vectorized version runs faster. The time for the non-vectorized version is only slightly faster than would be obtained by compiling with the /O1 or the -O1 option.
The following do not always prevent vectorization, but frequently either prevent it or cause the compiler to decide that vectorization would not be worthwhile.
// arrays accessed with stride 2
for (int i=0; i<SIZE; i+=2) b[i] += a[i] * x[i];
// inner loop accesses a with stride SIZE
for (int j=0; j<SIZE; j++) {
for (int i=0; i<SIZE; i++) b[i] += a[i][j] * x[j];
}
// indirect addressing of x using index array
for (int i=0; i<SIZE; i+=2) b[i] += a[i] * x[index[i]];
The typical message from the vectorization report is: vectorization possible but seems inefficient, although indirect addressing may also result in the following report: Existence of vector dependence.
A[0]=0;
for (j=1; j<MAX; j++) A[j]=A[j-1]+1;
// this is equivalent to:
A[1]=A[0]+1; A[2]=A[1]+1; A[3]=A[2]+1; A[4]=A[3]+1;
So the value of j gets propagated to all A[j]. This cannot safely be vectorized: if the first two iterations are executed simultaneously by a SIMD instruction, the value of A[1] is used by the second iteration before it has been calculated by the first iteration.
for (j=1; j<MAX; j++) A[j-1]=A[j]+1;
// this is equivalent to:
A[0]=A[1]+1; A[1]=A[2]+1; A[2]=A[3]+1; A[3]=A[4]+1;
This is not safe for general parallel execution, since the iteration with the write may execute before the iteration with the read. However, for vectorization, no iteration with a higher value of j can complete before an iteration with a lower value of j, and so vectorization is safe (i.e., gives the same result as non-vectorized code) in this case. The following example, however, may not be safe, since vectorization might cause some elements of A to be overwritten by the first SIMD instruction before being used for the second SIMD instruction.
for (j=1; j<MAX; j++) {
A[j-1]=A[j]+1;
B[j]=A[j]*2;
}
// this is equivalent to:
A[0]=A[1]+1; A[1]=A[2]+1; A[2]=A[3]+1; A[3]=A[4]+1;
sum=0;
for (j=1; j<MAX; j++) sum = sum + A[j]*B[j]
Although sum is both read and written in every iteration, the compiler recognizes such reduction idioms, and is able to vectorize them safely. The loop in example 1 was another example of a reduction, with a loop-invariant array element in place of a scalar.
These types of dependencies between loop iterations are sometimes known as loop-carried dependencies.
The above examples are of proven dependencies. However, the compiler cannot safely vectorize a loop if there is even a potential dependency. Consider the following example:
for (i = 0; i < size; i++) {
c[i] = a[i] * b[i];
}In the above example, the compiler needs to determine whether, for some iteration i, c[i] might refer to the same memory location as a[i] or b[i] for a different iteration. (Such memory locations are sometimes said to be “aliased”). For example, if a[i] pointed to the same memory location as c[i-1], there would be a read-after-write dependency as in the earlier example. If the compiler cannot exclude this possibility, it will not vectorize the loop unless you provide the compiler with hints.
Sometimes the compiler has insufficient information to decide to vectorize a loop. There are several ways to provide additional information to the compiler:
void copy(char *cp_a, char *cp_b, int n) {
for (int i = 0; i < n; i++) {
cp_a[i] = cp_b[i];
}
}
Without more information, a vectorizing compiler must conservatively assume that the memory regions accessed by the pointer variables cp_a and cp_b may (partially) overlap, which gives rise to potential data dependencies that prohibit straightforward conversion of this loop into SIMD instructions. At this point, the compiler may decide to keep the loop serial or, as done by the Intel® C/C++ compiler, generate a run-time test for overlap, where the loop in the true-branch can be converted into SIMD instructions:if (cp_a + n < cp_b || cp_b + n < cp_a)
/* vector loop */
for (int i = 0; i < n; i++) cp_a[i] = cp_b [i];
else
/* serial loop */
for (int i = 0; i < n; i++) cp_a[i] = cp_b[i];
Run-time data-dependency testing provides a generally effective way to exploit implicit parallelism in C or C++ code at the expense of a slight increase in code size and testing overhead. If the function copy is only used in specific ways, however, you can assist the vectorizing compiler as follows:
#pragma ivdep
void copy(char *cp_a, char *cp_b, int n) {
for (int i = 0; i < n; i++) {
cp_a[i] = cp_b[i];
}
}
You can also use the restrict keyword.
You may use the restrict keyword in the declarations of cp_a and cp_b, as shown below, to inform the compiler that each pointer variable provides exclusive access to a certain memory region. The restrict qualifier in the argument list lets the compiler know that there are no other aliases to the memory to which the pointers point. In other words, the pointer for which it is used provides the only means of accessing the memory in question in the scope in which the pointers live. Even if the code gets vectorized without the restrict keyword, the compiler checks for aliasing at run-time, if the restrict keyword was used. You may have to use an extra compiler option, such as /Qrestrict(Windows*) or -restrict (Linux* and MacOS* X) option for the Intel C/C++ compiler.void copy(char * __restrict cp_a, char * __restrict cp_b,
int n) {
for (int i = 0; i < n; i++) cp_a[i] = cp_b[i];
}
This method is convenient in case the exclusive access property holds for pointer variables that are used in a large portion of code with many loops because it avoids the need to annotate each of the vectorizable loops individually. Note, however, that both the loop-specific #pragma ivdep hint, as well as the pointer variable-specific restrict hint must be used with care because incorrect usage may change the semantics intended in the original program.
Another example is the following loop that may also not get vectorized because of a potential aliasing problem between pointers a, b and c:// potential unsupported loop structure
void add(float *a, float *b, float *c) {
for (int i=0; i<SIZE; i++) {
c[i] += a[i] + b[i];
}
}
If the restrict keyword is added to the parameters, the compiler will trust you, that you will not access the memory in question with any other pointer and vectorize the code properly:// let the compiler know, the pointers are safe with restrict
void add(float * __restrict a, float * __restrict b, float * __restrict c) {
for (int i=0; i<SIZE; i++) {
c[i] += a[i] + b[i];
}
}
The down-side of using restrict is that not all compilers support this keyword, so your source code may lose portability. If you care about source code portability you may want to consider using the compiler option /Qansi-alias(Windows*) or -ansi-alias(Linux* and MacOS* X) instead. However, compiler options work globally, so you have to make sure they do not cause harm to other code fragments.
Copyright © 1996-2011, Intel Corporation. All rights reserved.