HLO exploits the properties of source code constructs (for example, loops and arrays) in applications developed in high-level programming languages. Within HLO, loop transformation techniques include:
Loop Permutation or Interchange
Loop Distribution
Loop Fusion
Loop Unrolling
Data Prefetching
Scalar Replacement
Unroll and Jam
Loop Blocking or Tiling
Partial-Sum Optimization
Loadpair Optimization
Predicate Optimization
Loop Versioning with Runtime Data-Dependence Check (IA-64 architecture only)
Loop Versioning with Low Trip-Count Check
Loop Reversal
Profile-Guided Loop Unrolling
Loop Peeling
Data Transformation: Malloc Combining and Memset Combining
Loop Rerolling
Memset and Memcpy Recognition
Statement Sinking for Creating Perfect Loopnests
While the default optimization level, -O2 (Linux* OS and Mac OS* X) or /O2 (Windows* OS) option, performs some high-level optimizations (for example, prefetching, complete unrolling, etc.), specifying -O3 (Linux and Mac OS X) or /O3 (Windows) provides the best chance for performing loop transformations to optimize memory accesses; the scope of optimizations enabled by these options is different for IA-32 architecture, Intel® 64, and IA-64 architectures.
In conjunction with the vectorization options, -ax and -x (Linux and Mac OS X) or /Qax and /Qx (Windows), the -O3 (Linux and Mac OS X) or /O3 (Windows) option causes the compiler to perform more aggressive data dependency analysis than the default -O2 (Linux and Mac OS X) or /O2 (Windows).
Compiler prefetching is disabled in favor of the prefetching support available in the processors.
The -O3 (Linux and Mac OS X) or /O3 (Windows) option enables the -O2 (Linux and Mac OS X) or /O2 (Windows) option and adds more aggressive optimizations (like loop transformations); O3 optimizes for maximum speed, but may not improve performance for some programs.
The -ivdep-parallel (Linux) or /Qivdep-parallel (Windows) option implies there is no loop-carried dependency in the loop where an IVDEP directive is specified. (This strategy is useful for sparse matrix applications.)
Tune applications for IA-64 architecture by following these general steps:
Compile your program with -O3 (Linux) or /O3 (Windows) and -ipo (Linux) or /Qipo (Windows). Use profile guided optimization whenever possible.
Generate a high-level optimization report.
Check why loops are not software pipelined.
Make the changes indicated by the results of the previous steps.
Repeat these steps until you achieve the desired performance.
In general, you can use the following strategies to tune applications for multiple architectures:
Use !DEC$ ivdep to tell the compiler there is no dependency. You may also need the -ivdep-parallel (Linux and Mac OS X) or /Qivdep-parallel (Windows) option to indicate there is no loop carried dependency.
Use !DEC$ swp to enable software pipelining (useful for lop-sided control and unknown loop count).
Use !DEC$ loop count(n) when needed.
If cray pointers are used, use -safe-cray-ptr (Linux and Mac OS X) or /Qsafe-cray-ptr (Windows) to indicate there is no aliasing.
Use !DEC$ distribute point to split large loops (normally, this is automatically done).
Check that the prefetch distance is correct. Use CDEC$ prefetch to override the distance when it is needed.