HLO exploits the properties of source code constructs (for example, loops and arrays) in applications developed in high-level programming languages. Within HLO, loop transformation techniques include:
Loop Permutation or Interchange
Loop Distribution
Loop Fusion
Loop Unrolling
Data Prefetching
Scalar Replacement
Unroll and Jam
Loop Blocking or Tiling
Partial-Sum Optimization
Loadpair Optimization
Predicate Optimization
Loop Versioning with Runtime Data-Dependence Check (IA-64 architecture only)
Loop Versioning with Low Trip-Count Check
Loop Reversal
Profile-Guided Loop Unrolling
Loop Peeling
Data Transformation: Malloc Combining and Memset Combining
Loop Rerolling
Memset and Memcpy Recognition
Statement Sinking for Creating Perfect Loopnests
While the default optimization level, -O2 (Linux* OS and Mac OS* X) or /O2 (Windows* OS) option, performs some high-level optimizations (for example, prefetching, complete unrolling, etc.), specifying -O3 (Linux and Mac OS X) or /O3 (Windows) provides the best chance for performing loop transformations to optimize memory accesses; the scope of optimizations enabled by these options is different for IA-32 architecture, Intel® 64, and IA-64 architectures.
In conjunction with the vectorization options, -ax and -x (Linux and Mac OS X) or /Qax and /Qx (Windows), the -O3 (Linux and Mac OS X) or /O3 (Windows) option causes the compiler to perform more aggressive data dependency analysis than the default -O2 (Linux and Mac OS X) or /O2 (Windows).
Compiler prefetching is disabled in favor of the prefetching support available in the processors.
The -O3 (Linux and Mac OS X) or /O3 (Windows) option enables the -O2 (Linux and Mac OS X) or /O2 (Windows) option and adds more aggressive optimizations (like loop transformations); O3 optimizes for maximum speed, but may not improve performance for some programs.
The -ivdep-parallel (Linux) or /Qivdep-parallel (Windows) option implies there is no loop-carried dependency in the loop where an ivdep pragma is specified. (This strategy is useful for sparse matrix applications.)
Tune applications for IA-64 architecture by following these general steps:
Compile your program with -O3 (Linux) or /O3 (Windows) and -ipo (Linux) or /Qipo (Windows). Use profile guided optimization whenever possible.
Generate a high-level optimization report.
Check why loops are not software pipelined.
Make the changes indicated by the results of the previous steps.
Repeat these steps until you achieve the desired performance.
In general, you can use the following strategies to tune applications for multiple architectures:
Use #pragma ivdep to indicate there is no dependence. You might need to compile with the -ivdep-parallel (Linux and Mac OS X) or /Qivdep-parallel (Windows) option to absolutely specify no loop carried dependence.
Use #pragma swp to enable software pipelining (useful for lop-sided controls and unknown loop count).
Use #pragma loop count(n) when needed.
Use of -ansi-alias (Linux and Mac OS X) or /Qansi-alias (Windows) is helpful.
Add the restrict keyword to insure there is no aliasing. Compile with -restrict (Linux) or /Qrestrict (Windows).
Use -fargument-alias (Linux and Mac OS X) or /Qalias-args- (Windows) to indicate arguments are not aliased.
Use #pragma distribute point to split large loops (normally this is done automatically).
For C code, do not use unsigned int for loop indexes. HLO may skip optimization due to possible subscripts overflow. If upper bounds are pointer references, assign it to a local variable whenever possible.
Check that the prefetch distance is correct. Use #pragma prefetch to override the distance when it is needed.