Software Pipelining (SWP) Report (Linux* and Windows*)

The SWP report can provide details information about loops currently taking advantage of software pipelining available on IA-64 architecture based systems. The report can suggest reasons why the loops are not being pipelined.

The following command syntax examples demonstrates how to generate a SWP report for the Itanium® Compiler Code Generator (ECG) Software Pipeliner (SWP).

Operating System	Syntax Examples
Linux*	`icpc -c -opt-report -opt-report-phase=ecg_swp sample.cpp`
Windows*	`icl /c /Qopt-report /Qopt-report-phase:ecg_swp sample.cpp`

where -c (Linux) or /c (Windows) tells the compiler to stop at generating the object code (no linking occurs), -opt-report (Linux) or /Qopt-report (Windows) invokes the report generator, and -opt-report-phase=ecg_swp (Linux) or /Qopt-report-phase:ecg_swp (Windows) indicates the phase (ecg) for which to generate the report.

You can use -opt-report-file (Linux) or /Qopt-report-file (Windows) to specify an output file to capture the report results. Specifying a file to capture the results can help to reduce the time you spend analyzing the results and can provide a baseline for future testing.

Typically, loops that software pipeline will have a line that indicates the compiler has scheduled the loop for SWP in the report. If the -O3 (Linux) or /O3 (Windows) option is specified, the SWP report merges the loop transformation summary performed by the loop optimizer.

Some loops will not software pipeline (SWP) and others will not vectorize if function calls are embedded inside your loops. One way to get these loops to SWP or to vectorize is to inline the functions using IPO.

You can compile this example code to generate a sample SWP report, but you must use compile the example using a combination of -c -restrict (Linux) and /c /Qrestrict (Windows). The sample reports is also shown below.

Example
#define NUM 1024 void multiply_d(double a[][NUM], double b[][NUM], double c[restrict][NUM]){ int i,j,k; double temp; for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } }

Example

#define NUM 1024

void multiply_d(double a[][NUM], double b[][NUM], double c[restrict][NUM]){

int i,j,k;

double temp;

for(i=0;i<NUM;i++) {

for(j=0;j<NUM;j++) {

for(k=0;k<NUM;k++) {

c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

The following sample report shows the report phase that results from compiling the example code shown above (when using the ecg_swp phase).

Sample SWP Report
Swp report for loop at line 8 in _Z10multiply_dPA1024_dS0_S0_ in file SWP report.cpp Resource II = 2 Recurrence II = 2 Minimum II = 2 Scheduled II = 2 Estimated GCS II = 7 Percent of Resource II needed by arithmetic ops = 100% Percent of Resource II needed by memory ops = 50% Percent of Resource II needed by floating point ops = 50% Number of stages in the software pipeline = 6

Sample SWP Report

Swp report for loop at line 8 in _Z10multiply_dPA1024_dS0_S0_ in file SWP report.cpp

Resource II = 2

Recurrence II = 2

Minimum II = 2

Scheduled II = 2

Estimated GCS II = 7

Percent of Resource II needed by arithmetic ops = 100%

Percent of Resource II needed by memory ops = 50%

Percent of Resource II needed by floating point ops = 50%

Number of stages in the software pipeline = 6

Reading the Reports

To understand the SWP report results, you must know something about the terminology used and the related concepts. The following table describes some of the terminology used in the SWP report.

Term	Definition
II	Initiation Interval (II). The number of cycles between the start of one iteration and the next in the SWP. The presence of the term II in any SWP report indicates that SWP succeeded for the loop in question. II can be used in a quick calculation to determine how many cycles your loop will take, if you also know the number of iterations. Total cycle time of the loop is approximately N * Scheduled II + number Stages (Where N is the number of iterations of the loop). This is an approximation because it does not take into account the ramp-up and ramp-down of the prolog and epilog of the SWP, and only considers the kernel of the SWP loop. As you modify your code, it is generally better to see scheduled II go down, though it is really N* (Scheduled II) + Number of stages in the software pipeline that is ultimately the figure of merit.
Resource II	Resource II implies what the Initiation Interval should be when considering the number of functional units available.
Recurrence II	Recurrence II indicates what the Initiation Interval should be when there is a recurrence relationship in the loop. A recurrence relationship is a particular kind of a data dependency called a flow dependency like `a[i] = a[i-1]` where `a[i]` cannot be computed until `a[i-1]` is known. If Recurrence II is non-zero and there is no flow dependency in the code, then this indicates either Non-Unit Stride Access or memory aliasing. See Helping the Compiler for more information.
Minimum II	Minimum II is the theoretical minimum Initiation Interval that could be achieved.
Scheduled II	Scheduled II is what the compiler actually scheduled for the SWP.
number of stages	Indicates the number of stages. For example, in the report results below, the line "Number of stages in the software pipeline = 3" indicates there were three stages of work, which will show, in assembly, to be a load, an FMA instruction and a store.
loop-carried memory dependence edges	The loop-carried memory dependence edges means the compiler avoided WAR (Write After Read) dependency. Loop-carried memory dependence edges can indicate problems with memory aliasing. See Helping the Compiler.

Using the Report to Resolve Issues

One fast way to determine if specific loops have been software pipelined is to look for "r;Number of stages in the software pipeline" in the report; the phrase indicates that software pipelining for the associated loop was successfully applied.

Analyze the loops that did not SWP in order to determine how to enable SWP. If the compiler reports the "Loop was not SWP because...", see the following table for suggestions about how to correct possible problems:

Message in Report	Suggested Action
acyclic global scheduler can achieve a better schedule: => loop not pipelined	Indicates that the most likely cause is memory aliasing issues. For memory alias problems see memory aliasing (restrict, `#pragma ivdep`). Might indicate the application is accessing memory in a non-Unit Stride fashion. Non-Unit Stride issues may be indicated by an artificially high recurrence II; If you know there is no recurrence relationship (`a[i] = a[i-1] + b[i]`) in the loop, then a high recurrence II (greater than 0) is a sign that you are accessing memory non-Unit Stride. Rearranging code, perhaps a loop interchange, might help mitigate this problem.
Loop body has a function call	Indicates inlining the function might help solve the problem.
Not enough static registers	Indicates you should distribute the loop by separating it into two or more loops. On IA-64 architecture based systems you may use `#pragma distribute point`.
Not enough rotating registers	Indicates the loop carried values use the rotating registers. Distribute the loop. On IA-64 architecture based systems you may use `#pragma distribute point`.
Loop too large	Indicates you should distribute the loop. On IA-64 architecture based systems you may use the `#pragma distribute point`.
Loop has a constant trip count < 4	Indicates unrolling was insufficient. Attempt to fully unroll the loop. However, with small loops fully unrolling the loop is not likely to affect performance significantly.
Too much flow control	Indicates complex loop structure. Attempt to simplify the loop.

Index variable type used can greatly impact performance. In some cases, using loop index variables of type short or unsigned int can prevent software pipelining. If the report indicates performance problems in loops where the index variable is not int and if there are no other obvious causes, try changing the loop index variable to type int.