Vectorization and Loops

This topic discusses loop parallelization in the context of vectorization.

Using the -parallel (Linux*) or the /Qparallel (Windows*) option enables parallelization for both Intel® microprocessors and non-Intel microprocessors. The resulting executable may get additional performance gain on Intel microprocessors than on non-Intel microprocessors. The parallelization can also be affected by certain options, such as /arch or /Qx (Windows) or -m or -x (Linux and Mac OS X).

Using the -vec (Linux*) or the /Qvec (Windows*) option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch or /Qx (Windows) or -m or -x (Linux and Mac OS X).

Interactions with Loop Parallelization

Combine the -parallel (Linux* and Mac OS* X) or /Qparallel (Windows*) and -x (Linux) or /Qx (Windows) options to instructs the compiler to attempt both automatic loop parallelization and automatic loop vectorization in the same compilation.

In most cases, the compiler will consider outermost loops for parallelization and innermost loops for vectorization. If deemed profitable, however, the compiler may even apply loop parallelization and vectorization to the same loop.

See Programming with Auto-parallelization and Programming Guidelines for Vectorization.

In some rare cases, a successful loop parallelization (either automatically or by means of OpenMP* directives) may affect the messages reported by the compiler for a non-vectorizable loop in a non-intuitive way; for example, in the cases where -vec-report2 (Linux and Mac OS X) or /Qvec-report2 (Windows) option indicating loops were not successfully vectorized.

Types of Vectorized Loops

For integer loops, the 128-bit Intel® Streaming SIMD Extensions (Intel® SSE) and the Intel® Advanced Vector Extensions (Intel® AVX) provide SIMD instructions for most arithmetic and logical operators on 32-bit, 16-bit, and 8-bit integer data types, with limited support for the 64-bit integer data type.

Vectorization may proceed if the final precision of integer wrap-around arithmetic is preserved. A 32-bit shift-right operator, for instance, is not vectorized in 16-bit mode if the final stored value is a 16-bit integer. Also, note that because the Intel® SSE and the Intel® AVX instruction sets are not fully orthogonal (shifts on byte operands, for instance, are not supported), not all integer operations can actually be vectorized.

For loops that operate on 32-bit single-precision and 64-bit double-precision floating-point numbers, Intel® SSE provides SIMD instructions for the following arithmetic operators: addition (+), subtraction (-), multiplication (*), and division (/).

Additionally, the Streaming SIMD Extensions provide SIMD instructions for the binary MIN and MAX and unary SQRT operators. SIMD versions of several other mathematical operators (like the trigonometric functions SIN, COS, and TAN) are supported in software in a vector mathematical run-time library that is provided with the Intel® compiler of which the compiler takes advantage.

To be vectorizable, loops must be:

Intrinsic math functions such as sin(), log(), fmax(), and so on, are allowed, because the compiler runtime library contains vectorized versions of these functions. See the table below for a list of these functions; most exist in both float and double versions.

acos ceil fabs round
acosh cos floor sin
asin cosh fmax sinh
asinh erf fmin sqrt
atan erfc log tan
atan2 erfinv log10 tanh
atanh exp log2 trunc
cbrt exp2 pow  

The loop in the following example may be vectorized because sqrtf() is vectorizable and func() gets inlined. Inlining is enabled at default optimization for functions in the same source file. An inlining report may be obtained by setting the /Qopt-report-phase ipo_inl (Windows*) or -opt-report-phase ipo_inl (Linux* and MacOS* X ) option. float func(float x, float y, float xp, float yp) {
  float denom;
  denom = (x-xp)*(x-xp) + (y-yp)*(y-yp);
  denom = 1./sqrtf(denom);
  return denom;
}
float trap_int(float y, float x0, float xn, int nx, float xp, float yp) {
  float x, h, sumx;
  int i;
  h = (xn-x0) / nx;
  sumx = 0.5*( func(x0,y,xp,yp) + func(xn,y,xp,yp) );
  for (i=1;i<nx;i++) {
    x = x0 + i*h;
    sumx = sumx + func(x,y,xp,yp);
  }
  sumx = sumx * h;
  return sumx;
}
// Command line
> icc -c -vec-report2 trap_integ.c

trap_int.c(16) (col. 3): remark: LOOP WAS VECTORIZED.

Statements in the Loop Body

The vectorizable operations are different for floating-point and integer data.

Integer Array Operations

The statements within the loop body may contain char, unsigned char, short, unsigned short, int, and unsigned int. Calls to functions such as sqrt and fabs are also supported. Arithmetic operations are limited to addition, subtraction, bitwise AND, OR, and XOR operators, division (via run-time library call), multiplication, min, and max. You can mix data types but this may potentially cost you in terms of lowering efficiency. Some example operators where you can mix data types are multiplication, shift, or unary operators.

Other Operations

No statements other than the preceding floating-point and integer operations are allowed. In particular, note that the special __m64 __m128, and __m256 data types are not vectorizable. The loop body cannot contain any function calls. Use of the Streaming SIMD Extensions intrinsics ( for example, _mm_add_ps) or the Intel® AVX intrinsics (for example, _mm256_add_ps) are not allowed.

Data Dependency

Data dependency relations represent the required ordering constraints on the operations in serial loops. Because vectorization rearranges the order in which operations are executed, any auto-vectorizer must have at its disposal some form of data dependency analysis.

An example where data dependencies prohibit vectorization is shown below. In this example, the value of each element of an array is dependent on the value of its neighbor that was computed in the previous iteration.

Example 1: Data-dependent Loop

int i;
void dep(float *data){
  for (i=1; i<100; i++){
   data[i] = data[i-1]*0.25 + data[i]*0.5 + data[i+1]*0.25;
  }
}

The loop in the above example is not vectorizable because the WRITE to the current element DATA(I) is dependent on the use of the preceding element DATA(I-1), which has already been written to and changed in the previous iteration. To see this, look at the access patterns of the array for the first two iterations as shown below.

Example 1: Data-dependency Vectorization Patterns

i=1: READ data[0]
READ data[1]
READ data[2]
WRITE data[1]
i=2: READ data[1]
READ data[2]
READ data[3]
WRITE data[2]

In the normal sequential version of this loop, the value of DATA(1) read from during the second iteration was written to in the first iteration. For vectorization, it must be possible to do the iterations in parallel, without changing the semantics of the original loop.

Example 2: Data Independent Loop
for(i=0; i<100; i++)
a[i]=b[i];
//which has the following access pattern
read b[0]
write a[0]
read b[1]
write a[1]

Data Dependency Analysis

Data dependency analysis involves finding the conditions under which two memory accesses may overlap. Given two references in a program, the conditions are defined by:

The data dependency analyzer for array references is organized as a series of tests, which progressively increase in power as well as in time and space costs.

First, a number of simple tests are performed in a dimension-by-dimension manner, since independency in any dimension will exclude any dependency relationship. Multidimensional arrays references that may cross their declared dimension boundaries can be converted to their linearized form before the tests are applied.

Some of the simple tests that can be used are the fast greatest common divisor (GCD) test and the extended bounds test. The GCD test proves independency if the GCD of the coefficients of loop indices cannot evenly divide the constant term. The extended bounds test checks for potential overlap of the extreme values in subscript expressions.

If all simple tests fail to prove independency, the compiler will eventually resort to a powerful hierarchical dependency solver that uses Fourier-Motzkin elimination to solve the data dependency problem in all dimensions.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804


Submit feedback on this help topic

Copyright © 1996-2011, Intel Corporation. All rights reserved.