The compiler may or may not apply the following optimizations to your loop: Interchange, Unrolling, Cache Blocking, Loop Distribution, Loop Fusion, and LoadPair. These transformations are discussed in the following sections, including how to transform loops manually and how to control them with or internal options.
Loop Interchange is a nested loop transformation applied by High-level Optimization (HLO) that swaps the order of execution of two nested loops. Typically, the transformation is performed to provide sequential Unit Stride access to array elements used inside the loop to improve cache locality. The compiler -O3 (Linux* and Mac OS* X) or /O3 (Windows*) optimization looks for opportunities to apply loop interchange for you.
The following is an example of a loop interchange
Example |
---|
#define NUM 1024 void loop_interchange( double a[][NUM], double b[][NUM], double c[][NUM] ) { int i,j,k; // Loop before Loop Interchange for(i=0;i<NUM;i++) for(j=0;j<NUM;j++) for(k=0;k<NUM;k++) c[i][j] =c[i][j] + a[i][k] * b[k][j]; // Loop after Loop Interchange of k & j loops for(i=0;i<NUM;i++) for(k=0;k<NUM;k++) for(j=0;j<NUM;j++) c[i][j] =c[i][j] + a[i][k] * b[k][j]; } |
See discussion under Non-Unit Stride Memory Access for more detail.
Loop unrolling is a loop transformation generally used by HLO that can take better advantage of Instruction-Level Parallelism (ILP), keeping as many functional units busy doing useful work as possible during single loop iteration. In loop unrolling, you add more work to the inside of the loop while doing fewer loop iterations in exchange.
There are pragmas and internal options to control unroll behavior.
Pragma |
Description |
---|---|
#pragma unroll |
Specifying unroll by itself allows the compiler to determine the unroll factor. |
#pragma unroll(n) |
Specifying -unrolln (Linux* OS and Mac OS X) or /Qunroll:n (Windows) instructs the compiler to unroll the loop n times. |
#pragma nounroll |
Specifies nounroll instructs the compiler not to unroll a specified loop. |
Generally, you should unroll loops by factors of cache line sizes; experiment with the number. Consider the following loop:
Example |
---|
#define NUM 1025 void loop_unroll_before( double a[][NUM], double b[][NUM], double c[][NUM]) { int i,j; int N,M; N=NUM; M=5; for(i=0;i<N; i++) for (j=0;j<M; j++) a[i][j] = b[i][j] + c[i][j]; } |
Assume you want to unroll the "i" or outer loop by a factor 4, but you notice that 4 does not evenly divide N of 1025. Unrolling in this case is difficult; however, you might use a "post conditioning loop" to take care of the unusual case as follows:
Example |
---|
#define NUM 1025 void loop_unroll_after( double a[][NUM], double b[][NUM], double c[][NUM]) { int i,j,K; int N,M; N=NUM; M=5; K = N % 4; // Main part of loop. for(i=0;i<N-K; i+=4) for (j=0;j<M; j++) { a[i][j] = b[i][j] + c[i][j]; a[i+1][j] = b[i+1][j] + c[i+1][j]; a[i+2][j] = b[i+2][j] + c[i+2][j]; a[i+3][j] = b[i+3][j] + c[i+3][j]; } // Post conditioning part of loop. for(i= N-K+1;i<N; i+=4) for (j=0;j<M; j++) a[i][j] = b[i][j] + c[i][j]; } |
Post conditioning is preferred over pre-conditioning because post conditioning will preserve the data alignment and avoid the cost of memory alignment access penalties.
Cache blocking involves structuring data blocks so that they conveniently fit into a portion of the L1 or L2 cache. By controlling data cache locality, an application can minimize performance delays due to memory bus access. The application controls the behavior by dividing a large array into smaller blocks of memory so a thread can make repeated accesses to the data while the data is still in cache.
For example, image processing and video applications are well suited to cache blocking techniques because an image can be processed on smaller portions of the total image or video frame. Compilers often use the same technique, by grouping related blocks of instructions close together so they execute from the L2 cache.
The effectiveness of the cache blocking technique depends on data block size, processor cache size, and the number of times the data is reused. Cache sizes vary based on processor. An application can detect the data cache size using the CPUID instruction and dynamically adjust cache blocking tile sizes to maximize performance. As a general rule, cache block sizes should target approximately one-half to three-quarters the size of the physical cache. For systems that are Hyper-Threading Technology (HT Technology) enabled target one-quarter to one-half the physical cache size. (See Designing for Hyper-Threading Technology for more other design considerations.)
Cache blocking is applied in HLO and is used on large arrays where the arrays cannot all fit into cache simultaneously. This method is one way of pulling a subset of data into cache (in a small region), and using this cached data as effectively as possible before the data is replaced by new data from memory.
Example |
---|
#define NUM 1024 void cache_blocking_before( double a[][NUM][NUM], double b[][NUM][NUM], double c[][NUM][NUM], int N ) { int i,j,k; N=1000; for (i=0;i < N; i++) for (j=0;j < N; j++) for (k=0;k < N; k++) a[i][j][k] = a[i][j][k] + b[i][j][k]; } #define NUM 1024 void cache_blocking_after( double a[][NUM][NUM], double b[][NUM][NUM], double c[][NUM][NUM], int N ) { int i,j,k,u,v; N=1000; for (v=0; v<N; v+=20) for (u=0; u<N; u+=20) for (k=v; k<v+20; k++) for (j=u;j< u+20;j++) for (i=0;i < N; i++) a[i][j][k] = a[i][j][k] + b[i][j][k]; } |
The cache block size is set to 20. The goal is to read in a block of cache, do every bit of computing we can with the data in this cache, then load a new block of data into cache. There are 20 elements of A and 20 elements of B in cache at the same time and you should do as much work with this data as you can before you increment to the next cache block.
Blocking factors will be different for different architectures. Determine the blocking factors experimentally. For example, different blocking factors would be required for single precision versus double precision. Typically, the overall impact to performance can be significant.
Loop distribution is a high-level loop transformation that splits a large loop into two smaller loops. It can be useful in cases where optimizations like software-pipelining (SWP) or vectorization cannot take place due to excessive register usage. By splitting a loop into smaller segments, it may be possible to get each smaller loop or at least one of the smaller loops to SWP or vectorize. An example is as follows:
Example |
---|
#define NUM 1024 void loop_distribution_before( double a[NUM], double b[NUM], double c[NUM], double x[NUM], double y[NUM], double z[NUM] ) { int i; // Before distribution or splitting the loop. for (i=0; i< NUM; i++) { a[i] = a[i] + i; b[i] = b[i] + i; c[i] = c[i] + i; x[i] = x[i] + i; y[i] = y[i] + i; z[i] = z[i] + i; } } #define NUM 1024 void loop_distribution_after( double a[NUM], double b[NUM], double c[NUM], double x[NUM], double y[NUM], double z[NUM] ) { int i; // After distribution or splitting the loop. for (i=0; i< NUM; i++) { a[i] = a[i] +i; b[i] = b[i] +i; c[i] = c[i] +i; } for (i=0; i< NUM; i++) { x[i] = x[i] +i; y[i] = y[i] +i; z[i] = z[i] +i; } } |
There are pragmas to suggest distributing loops to the compiler as follows:
Example |
---|
#pragma distribute point |
Placed outside a loop, the compiler will attempt to distribute the loop based on an internal heuristic. The following is an example of using the pragma outside the loop:
Example |
---|
#define NUM 1024 void loop_distribution_pragma1( double a[NUM], double b[NUM], double c[NUM], double x[NUM], double y[NUM], double z[NUM] ) { int i; // Before distribution or splitting the loop #pragma distribute point for (i=0; i< NUM; i++) { a[i] = a[i] + i; b[i] = b[i] + i; c[i] = c[i] + i; x[i] = x[i] + i; y[i] = y[i] + i; z[i] = z[i] + i; } } |
Placed within a loop, the compiler will attempt to distribute the loop at that point. All loop-carried dependencies will be ignored. The following example uses the pragma within a loop to precisely indicate where the split should take place:
Example |
---|
#define NUM 1024 void loop_distribution_pragma2( double a[NUM], double b[NUM], double c[NUM], double x[NUM], double y[NUM], double z[NUM] ) { int i; // After distribution or splitting the loop. for (i=0; i< NUM; i++) { a[i] = a[i] +i; b[i] = b[i] +i; c[i] = c[i] +i; #pragma distribute point x[i] = x[i] +i; y[i] = y[i] +i; z[i] = z[i] +i; } } |
Loop Fusion is the inverse of Loop Distribution. The idea in loop fusion is to join two loops that have the same trip count in order to reduce loop overhead. The -O3 (Linux) or /O3 (Windows) option will attempt loop fusion if the opportunity is present.
Load pairs (ldfp) are instructions that load two contiguous single or double precision values from memory in one move. Load pairs can significantly improve performance.
There might be cases where these manual transformations are called acceptable or even preferred. As a general rule, you should let the compiler transform loops for you. Manually transform loops as a last resort; use this strategy only in cases where you are attempting to gain performance increases.
Manual loop transformations have many disadvantages, which include the following:
Application code becomes harder to maintain over time.
New compiler features can cause you to lose any optimization you gain by manually transforming the loop.
Architectural requirements might restrict your code to a specific architecture unintentionally.
The HLO report can give you an idea of what loop transformations have been applied by the compiler.
Experimentation is a critical component in manually transforming loops. You might try to apply a loop transformation that the compiler ignored. Sometimes, it is beneficial to apply a manual loop transformation that the compiler has already applied with -O3 (Linux) or /O3 (Windows).