Auto-parallelization Overview

The auto-parallelization feature of the Intel® compiler automatically translates serial portions of the input program into equivalent multithreaded code. Automatic parallelization determines the loops that are good worksharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as needed in programming with OpenMP* directives. The OpenMP and auto-parallelization applications provide the performance gains from shared memory on multiprocessor and dual core systems.

The auto-parallelization feature of the Intel® compiler automatically translates serial portions of the input program into equivalent multithreaded code. The auto-parallelizer analyzes the dataflow of the loops in the application source code and generates multithreaded code for those loops which can safely and efficiently be executed in parallel.

This behavior enables the potential exploitation of the parallel architecture found in symmetric multiprocessor (SMP) systems.

Automatic parallelization frees developers from having to:

find loops that are good worksharing candidates
perform the dataflow analysis to verify correct parallel execution
partition the data for threaded code generation as is needed in programming with OpenMP* directives.

The parallel run-time support provides the same run-time features as found in OpenMP, such as handling the details of loop iteration modification, thread scheduling, and synchronization.

While OpenMP directives enable serial applications to transform into parallel applications quickly, a programmer must explicitly identify specific portions of the application code that contain parallelism and add the appropriate compiler directives.

Auto-parallelization, which is triggered by the -parallel (Linux* OS and Mac OS* X) or /Qparallel (Windows* OS) option, automatically identifies those loop structures that contain parallelism. During compilation, the compiler automatically attempts to deconstruct the code sequences into separate threads for parallel processing. No other effort by the programmer is needed.

Note

IA-64 architecture only: Specifying these options implies -opt-mem-bandwith1 (Linux) or /Qopt-mem-bandwidth1 (Windows).

Serial code can be divided so that the code can execute concurrently on multiple threads. For example, consider the following serial code example.

Example 1: Original Serial Code
void ser(int a, int b, int c) { for (int i=0; i<100; i++) a[i] = a[i] + b[i] c[i]; }

Example 1: Original Serial Code

void ser(int *a, int *b, int *c)

{

for (int i=0; i<100; i++)

a[i] = a[i] + b[i] * c[i];

}

The following example illustrates one method showing how the loop iteration space, shown in the previous example, might be divided to execute on two threads.

Example 2: Transformed Parallel Code
void par(int a, int b, int c) { int i; // Thread 1 for (i=0; i<50; i++) a[i] = a[i] + b[i] c[i]; // Thread 2 for (i=50; i<100; i++) a[i] = a[i] + b[i] * c[i]; }

Example 2: Transformed Parallel Code

void par(int *a, int *b, int *c)

{

int i;

// Thread 1

for (i=0; i<50; i++)

a[i] = a[i] + b[i] * c[i];

// Thread 2

for (i=50; i<100; i++)

a[i] = a[i] + b[i] * c[i];

}

Auto-Vectorization and Parallelization

Auto-vectorization detects low-level operations in the program that can be done in parallel, and then converts the sequential program to process 2, 4, 8 or up to 16 elements in one operation, depending on the data type. In some cases auto-parallelization and vectorization can be combined for better performance results.

The following example demonstrates how code can be designed to explicitly benefit from parallelization and vectorization. Assuming you compile the code shown below using -parallel -xSSE3 (Linux*) or /Qparallel /QxSSE3 (Windows*), the compiler will parallelize the outer loop and vectorize the innermost loop.

Example
#include <stdio.h> #define ARR_SIZE 500 //Define array int main() { int matrix[ARR_SIZE][ARR_SIZE]; int arrA[ARR_SIZE]={10}; int arrB[ARR_SIZE]={30}; int i, j; for(i=0;i<ARR_SIZE;i++) { for(j=0;j<ARR_SIZE;j++) { matrix[i][j] = arrB[i]*(arrA[i]%2+10); } } }

Example

#include <stdio.h>

#define ARR_SIZE 500 //Define array

int main()

{

int matrix[ARR_SIZE][ARR_SIZE];

int arrA[ARR_SIZE]={10};

int arrB[ARR_SIZE]={30};

int i, j;

for(i=0;i<ARR_SIZE;i++)

{

for(j=0;j<ARR_SIZE;j++)

{

matrix[i][j] = arrB[i]*(arrA[i]%2+10);

}

Compiling the example code with the correct options, the compiler should report results similar to the following:

vectorization.c(18) : (col. 6) remark: LOOP WAS VECTORIZED.

vectorization.c(16) : (col. 3) remark: LOOP WAS AUTO-PARALLELIZED.

Auto-vectorization can help improve performance of an application that runs on systems based on Pentium®, Pentium with MMX™ technology, Pentium II, Pentium III, and Pentium 4 processors.

With the right choice of options, you can:

Increase the performance of your application with minimum effort
Use compiler features to develop multithreaded programs faster

Additionally, with the relatively small effort of adding OpenMP directives to existing code you can transform a sequential program into a parallel program.

The following example demonstrates one method of using the OpenMP pragmas within code.

Example
#include <stdio.h> #define ARR_SIZE 100 //Define array void foo(int ma[][ARR_SIZE], int mb[][ARR_SIZE], int a, int b, int c); int main() { int arr_a[ARR_SIZE]; int arr_b[ARR_SIZE]; int arr_c[ARR_SIZE]; int i,j; int matrix_a[ARR_SIZE][ARR_SIZE]; int matrix_b[ARR_SIZE][ARR_SIZE]; #pragma omp parallel for // Initialize the arrays and matrices. for(i=0;i<ARR_SIZE; i++) { arr_a[i]= i; arr_b[i]= i; arr_c[i]= ARR_SIZE-i; for(j=0; j<ARR_SIZE;j++) { matrix_a[i][j]= j; matrix_b[i][j]= i; } } foo(matrix_a, matrix_b, arr_a, arr_b, arr_c); } void foo(int ma[][ARR_SIZE], int mb[][ARR_SIZE], int a, int b, int c) { int i, num, arr_x[ARR_SIZE]; #pragma omp parallel for private(num) // Expresses the parallelism using the OpenMP pragma: parallel for. // The pragma guides the compiler generating multithreaded code. // Array arr_X, mb, b, and c are shared among threads based on OpenMP // data sharing rules. Scalar num si specifed as private // for each thread. for(i=0;i<ARR_SIZE;i++) { num = ma[b[i]][c[i]]; arr_x[i]= mb[a[i]][num]; printf("Values: %d\n", arr_x[i]); //prints values 0-ARR_SIZE-1 } }

Example

#include <stdio.h>

#define ARR_SIZE 100 //Define array

void foo(int ma[][ARR_SIZE], int mb[][ARR_SIZE], int *a, int *b, int *c);

int main()

{

int arr_a[ARR_SIZE];

int arr_b[ARR_SIZE];

int arr_c[ARR_SIZE];

int i,j;

int matrix_a[ARR_SIZE][ARR_SIZE];

int matrix_b[ARR_SIZE][ARR_SIZE];

#pragma omp parallel for

// Initialize the arrays and matrices.

for(i=0;i<ARR_SIZE; i++)

{

arr_a[i]= i;

arr_b[i]= i;

arr_c[i]= ARR_SIZE-i;

for(j=0; j<ARR_SIZE;j++)

{

matrix_a[i][j]= j;

matrix_b[i][j]= i;

}

foo(matrix_a, matrix_b, arr_a, arr_b, arr_c);

}

void foo(int ma[][ARR_SIZE], int mb[][ARR_SIZE], int *a, int *b, int *c)

{

int i, num, arr_x[ARR_SIZE];

#pragma omp parallel for private(num)

// Expresses the parallelism using the OpenMP pragma: parallel for.

// The pragma guides the compiler generating multithreaded code.

// Array arr_X, mb, b, and c are shared among threads based on OpenMP

// data sharing rules. Scalar num si specifed as private

// for each thread.

for(i=0;i<ARR_SIZE;i++)

{

num = ma[b[i]][c[i]];

arr_x[i]= mb[a[i]][num];

printf("Values: %d\n", arr_x[i]); //prints values 0-ARR_SIZE-1

}