Loop independence is important since loops that are independent can be parallelized. Independent loops can be parallelized in a number of ways, from the course-grained parallelism of OpenMP*, to fine-grained Instruction Level Parallelism (ILP) of vectorization and software pipelining.
Loops are considered independent when the computation of iteration Y of a loop can be done independently of the computation of iteration X. In other words, if iteration 1 of a loop can be computed and iteration 2 simultaneously could be computed without using any result from iteration 1, then the loops are independent.
Occasionally, you can determine if a loop is independent by comparing results from the output of the loop with results from the same loop written with a decrementing index counter.
For example, the loop shown in example 1 might be independent if the code in example 2 generates the same result.
Example |
---|
#define MAX 1024 void loop_indep1(int a[MAX],int b[MAX]) { for (int j=0;j<MAX;j++) a[j] = b[j]; } #define MAX 1024 void loop_indep2(int a[MAX],int b[MAX]) { for (int j=MAX;j>0;j--) a[j] = b[j]; } |
When loops are dependent, improving loop performance becomes much more difficult. Loops can be dependent in several, general ways.
The following sections illustrate the different loop dependencies.
Cross-iteration flow dependence is created when variables are written then read in different iterations, as shown in the following example:
Example |
---|
void flow_dep(double A[]) { for (int j=1; j<1024; j++) A[j]=A[j-1]; } |
The above example is equivalent to the following lines for the first few iterations:
Sample Iterations |
---|
A[1]=A[0]; A[2]=A[1]; |
Recurrence relations feed information forward from one iteration to the next.
Example |
---|
void time_stepping_loops(double a[], double b[]) { for(int j=1; j<MAX; j++) { a[j] = a[j-1] + b[j]; } } |
Most recurrences cannot be made fully parallel. Instead, look for a loop further out or further in to parallelize. You might be able to get more performance gains through unrolling.
Cross-iteration anti-dependence is created when variables are read then written in different iterations, as shown in the following example:
Example |
---|
void anti_dep1(double A[]) { for (int j=1; j<1024; j++) A[j]=A[j+1]; } |
The above example is equivalent to the following lines for the first few iterations:
Sample Iterations |
---|
A[1]=A[2]; A[2]=A[3]; |
Cross-iteration output dependence is where variables are written then rewritten in a different iteration. The following example illustrates this type of dependency:
Example |
---|
void anti_dep2(double A[], double B[], double C[]) { for (int j=1; j<1024; j++) { A[j]=B[j]; A[j+1]=C[j]; } } |
The above example is equivalent to the following lines for the first few iterations:
Sample Iterations |
---|
A[1]=B[1]; A[2]=C[1]; A[2]=B[2]; A[3]=C[2]; |
The Intel® compiler can successfully vectorize or software pipeline (SWP) most loops containing reductions on simple math operators like multiplication (*), addition (+), subtraction (-), and division (/). Reductions collapse array data to scalar data by using associative operations:
Example |
---|
void reduction(double * sum, double c[]) { for (int j=0; j<MAX; j++) { *sum = *sum + c[j]; } } |
The compiler might occasionally misidentify a reduction and report flow-, anti-, output-dependencies or sometimes loop-carried memory-dependency-edges; in such cases, the compiler will not vectorize or SWP the loop. In such cases, recognize that the programming construct is simply a reduction, and direct the compiler through the use of pragmas to vectorize or SWP the loop; you can pragmas (like, #pragma ivdep or #pragma swp) to help the compiler in these cases.