High-level Optimization (HLO) performs specific optimizations based on the usefulness and applicability of each optimization. The HLO report can provide information on all relevant areas plus structure splitting and loop-carried scalar replacement, and it can provide information about interchanges not performed for the following reasons:
Function call are inside the loop
Imperfect loop nesting
Reliance on data dependencies; dependencies preventing interchange are also reported.
Original order was proper but it might have been considered inefficient to perform the interchange.
For example, the report can provide clues to why the compiler was unable to apply loop interchange to a loop nest that might have been considered a candidate for optimization. If the reported problems (bottlenecks) can be removed by changing the source code, the report suggests the possible loop interchanges.
Depending on the operating system, you must specify the following options to enable HLO and generate the reports:
Linux* and Mac OS* X: -x, -O2 or -O3, -opt-report 3, -opt-report-phase=hlo
Windows*: /Qx, /O2 or /O3, /Qopt-report:3, /Qopt-report-phase:hlo
See High-level Optimization Overview for information about enabling HLO.
The following command examples illustrate the general command needed to create HLO report with combined options.
Operating System |
Example Command |
---|---|
Linux and Mac OS X |
icpc -c -xSSE3 -O3 -opt-report 3 -opt-report-phase=hlo sample.cpp |
Windows |
icl /c /QxSSE3 /O3 /Qopt-report:3 /Qopt-report-phase:hlo sample.cpp |
You can use -opt-report-file (Linux and Mac OS X) or /Qopt-report-file (Windows) to specify an output file to capture the report results. Specifying a file to capture the results can help to reduce the time you spend analyzing the results and can provide a baseline for future testing.
The report provides information using a specific format. The report format for Windows* is different from the format on Linux* and Mac OS* X. While there are some common elements in the report output, the best way to understand what kinds of advice the report can provide is to show example code and the corresponding report output.
Example 1: This example illustrates the condition where a function call is inside a loop.
Example 1 |
---|
void bar (int *A, int **B); int foo (int *A, int **B, int N) { int i, j; for (j=0; j<N; j++) { for (i=0; i<N; i++) { B[i][j] += A[j]; bar(A,B); } } return 1; } |
Regardless of the operating system, the reports list optimization results on specific functions by presenting a line above there reported action. The line format and description are included below.
The following table summarizes the common report elements and provides a general description to help interpret the results.
Report Element |
Description |
---|---|
String listing information about the function being reported on. The string uses the following format. <source name>;<start line>;<end line>;<optimization>; <function name>;<element type> For example, the reports listed below report the following information: Linux and Mac OS X: <sample1.c;-1:-1;hlo;foo;0> Windows: <sample1.c;-1:-1;hlo;_foo;0> |
The compact string contains the following information:
|
Several report elements grouped together. QLOOPS 2/2 ENODE LOOPS 2 unknown 0 multi_exit_do 0 do 2 linear_do 2 LINEAR HLO EXPRESSIONS: 17 / 18 |
Windows only: This section of the report lists the following information:
|
The code sample list above will result in a report output similar to the following.
Operating System |
Example 1 Report Output |
---|---|
Linux and Mac OS X |
<sample1.c;-1:-1;hlo;foo;0> High Level Optimizer Report (foo) Block, Unroll, Jam Report: (loop line numbers, unroll factors and type of transformation) <sample1.c;7:7;hlo_unroll;foo;0> Loop at line 7 unrolled with remainder by 2 |
Windows |
<sample1.c;-1:-1;hlo;_foo;0> High Level Optimizer Report (_foo) QLOOPS 2/2 ENODE LOOPS 2 unknown 0 multi_exit_do 0 do 2 linear_do 2 LINEAR HLO EXPRESSIONS: 17 / 18 ------------------------------------------------------------------------------ <sample1.c;6:6;hlo_linear_trans;_foo;0> Loop Interchange not done due to: User Function Inside Loop Nest Advice: Loop Interchange, if possible, might help Loopnest at lines: 6 7 : Suggested Permutation: (1 2 ) --> ( 2 1 ) |
Example 2: This example illustrates the condition where the loop nesting prohibits interchange.
Example 2 |
---|
int foo (int *A, int **B, int N) { int i, j; for (j=0; j<N; j++) { A[j] = i + B[i][1]; for (i=0; i<N; i++) { B[i][j] += A[j]; } } return 1; } |
The code sample listed above will result in a report output similar to the following.
Operating System |
Example 2 Report Output |
---|---|
Linux and Mac OS X |
<sample2.c;-1:-1;hlo;foo;0> High Level Optimizer Report (foo) <sample2.c;7:7;hlo_scalar_replacement;in foo;0> #of Array Refs Scalar Replaced in foo at line 7=2 #of Array Refs Scalar Replaced in foo at line 7=1 Block, Unroll, Jam Report: (loop line numbers, unroll factors and type of transformation) <sample2.c;7:7;hlo_unroll;foo;0> Loop at line 7 unrolled with remainder by 2 |
Windows |
<sample2.c;-1:-1;hlo;_foo;0> High Level Optimizer Report (_foo) QLOOPS 2/2 ENODE LOOPS 2 unknown 0 multi_exit_do 0 do 2 linear_do 2 LINEAR HLO EXPRESSIONS: 22 / 27 ------------------------------------------------------------------------------ <sample2.c;7:7;hlo_scalar_replacement;in _foo;0> #of Array Refs Scalar Replaced in _foo at line 7=1 <sample2.c;5:5;hlo_linear_trans;_foo;0> Loop Interchange not done due to: Imperfect Loop Nest (Either at Source or due to other Compiler Transformations) Advice: Loop Interchange, if possible, might help Loopnest at lines: 5 7 : Suggested Permutation: (1 2 ) --> ( 2 1 ) |
Example 3: This example illustrates the condition where data dependence prohibits loop interchange.
Example 3 |
---|
int foo (int **A, int **B, int **C, int N) { int i, j; for (j=0; j<N; j++) { for (i=0; i<N; i++) { A[i][j] = C[i][j] * 2; B[i][j] += A[i][j] * C[i][j]; } } return 1; } |
The code sample listed above will result in a report output similar to the following.
Operating System |
Example 3 Report Output |
---|---|
Linux and Mac OS X |
<sample3.c;-1:-1;hlo;foo;0> High Level Optimizer Report (foo) Block, Unroll, Jam Report: (loop line numbers, unroll factors and type of transformation) <sample3.c;6:6;hlo_unroll;foo;0> Loop at line 6 unrolled with remainder by 2 |
Windows |
<sample3.c;-1:-1;hlo;_foo;0> High Level Optimizer Report (_foo) QLOOPS 2/2 ENODE LOOPS 2 unknown 0 multi_exit_do 0 do 2 linear_do 2 LINEAR HLO EXPRESSIONS: 36 / 36 ------------------------------------------------------------------------------ <sample3.c;5:5;hlo_linear_trans;_foo;0> Loop Interchange not done due to: Data Dependencies Dependencies found between following statements: [From_Line# -> (Dependency Type) To_Line#] [7 ->(Anti) 8] [7 ->(Flow) 8] [7 ->(Output) 8] [7 ->(Flow) 7] [7 ->(Anti) 7] [7 ->(Output) 7] Advice: Loop Interchange, if possible, might help Loopnest at lines: 5 6 : Suggested Permutation: (1 2 ) --> ( 2 1 ) |
Example 4: This example illustrates the condition where the loop order was determined to be proper, but loop interchange might offer only marginal relative improvement. To compile this code add the -restrict (Linux and Mac OS X) or /Qrestrict (Windows) option to the other options when generating the report.
Example 4 |
---|
int foo (int ** restrict A, int ** restrict B, int N) { int i, j, value; for (j=0; j<N; j++) { for (i=0; i<N; i++) { A[j][i] += B[i][j]; } } value = A[1][1]; return value; } |
The code sample listed above will result in a report output similar to the following.
Operating System |
Example 4 Report Output |
---|---|
Linux and Mac OS X |
<sample4.c;-1:-1;hlo;foo;0> High Level Optimizer Report (foo) Block, Unroll, Jam Report: (loop line numbers, unroll factors and type of transformation) <sample4.c;6:6;hlo_unroll;foo;0> Loop at line 6 unrolled with remainder by 2 |
Windows |
<sample4.c;-1:-1;hlo;_foo;0> High Level Optimizer Report (_foo) QLOOPS 2/2 ENODE LOOPS 2 unknown 0 multi_exit_do 0 do 2 linear_do 2 LINEAR HLO EXPRESSIONS: 18 / 18 |
Example 5: This example illustrates the conditions where the loop nesting was imperfect and the loop order was good, but loop interchange would offer marginal relative improvements. To compile this code add the -restrict (Linux and Mac OS X) or /Qrestrict (Windows) option to the other options when generating the report.
Example |
---|
int foo (int ** restrict A, int ** restrict B, int ** restrict C, int N) { int i, j, sum; for (j=0; j<N; j++) { sum += A[1][1]; for (i=0; i<N; i++) { sum = B[j][i] + C[i][j]; } } return sum; } |
The code sample listed above will result in a report output similar to the following.
Operating System |
Example 5 Report Output |
---|---|
Linux and Mac OS X |
<sample5.c;-1:-1;hlo;foo;0> High Level Optimizer Report (foo) |
Windows |
<sample5.c;-1:-1;hlo;_foo;0> High Level Optimizer Report (_foo) QLOOPS 2/2 ENODE LOOPS 2 unknown 0 multi_exit_do 0 do 2 linear_do 2 LINEAR HLO EXPRESSIONS: 16 / 19 ------------------------------------------------------------------------------ <sample5.c;5:5;hlo_linear_trans;_foo;0> Loop Interchange not done due to: Imperfect Loop Nest (Either at Source or due t o other Compiler Transformations) Advice: Loop Interchange, if possible, might help Loopnest at lines: 5 7 : Suggested Permutation: (1 2 ) --> ( 2 1 ) |
Example 6: This example illustrates the condition where perfect and imperfect loop nesting exists; however, the correctly nested loop contains data dependency.
Example |
---|
int foo (int ***A, int ***B, int **C, int N) { int q, i, j, k; q = 0; while ( A[q][0][0] != 0) { for (j=0; j<N; j++) { A[j][0][0] = j + B[j][0][0]; for (i=0; i<N; i++) { for (k=0; k<N; k++) { B[k][i][j] += A[j][0][0] + C[i][j]; } } } A[q][0][0] = B[0][q][0] + 5; } return 1; } |
The code sample listed above will result in a report output similar to the following.
Operating System |
Example Report Output |
---|---|
Linux and Mac OS X |
<sample6.c;-1:-1;hlo;foo;0> High Level Optimizer Report (foo) Block, Unroll, Jam Report: (loop line numbers, unroll factors and type of transformation) <sample6.c;9:9;hlo_unroll;foo;0> Loop at line 9 unrolled with remainder by 2 [root@infodev-test hlo_samples_cpp]# |
Windows |
<sample6.c;-1:-1;hlo;_foo;0> High Level Optimizer Report (_foo) QLOOPS 2/4 ENODE LOOPS 2 unknown 0 multi_exit_do 0 do 2 linear_do 2 LINEAR HLO EXPRESSIONS: 34 / 34 ------------------------------------------------------------------------------ <sample6.c;8:8;hlo_linear_trans;_foo;0> Loop Interchange not done due to: Data Dependencies Dependencies found between following statements: [From_Line# -> (Dependency Type) To_Line#] [10 ->(Flow) 10] [10 ->(Anti) 10] [10 ->(Output) 10] Advice: Loop Interchange, if possible, might help Loopnest at lines: 8 9 : Suggested Permutation: (1 2 ) --> ( 2 1 ) |
While the HLO report tells you what loop transformations the compiler performed and provides some advice, the omission of a given loop transformation might imply that there are transformations the compiler might attempt. The following list suggests some transformations you might want to apply. (Manual optimization techniques, like manual cache blocking, should be avoided or used only as a last resort.)
Loop Interchanging - Swap the execution order of two nested loops to gain a cache locality or unit-stride access performance advantage.
Distributing - Distribute or split up one large loop into two smaller loops. This strategy might provide an advantage when too many registers are being consumed in a large loop.
Fusing - Fuse two smaller loops with the same trip count together to improve data locality.
Loop Blocking - Use cache blocking to arrange a loop so it will perform as many computations as possible on data already residing in cache. (The next block of data is not read into cache until computations using the first block are finished.)
Unrolling - Unrolling is a way of partially disassembling a loop structure so that fewer numbers of iterations of the loop are required; however, each resulting loop iteration is larger. Unrolling can be used to hide instruction and data latencies, to take advantage of floating point loadpair instructions, and to increase the ratio of real work done per memory operation.
Prefetching - Request the compiler to bring data in from relatively slow memory to a faster cache several loop iterations ahead of when the data is actually needed.
Load Pairing - Use an instruction to bring two floating point data elements in from memory in a single step.