Understanding Run-time Performance

The information in this topic assumes that you are using a performance optimization methodology and have analyzed the application type you are optimizing.

After profiling your application to determine where best to spend your time, attempt to discover what optimizations and what limitations have been imposed by the compiler. Use the compiler reports to determine what to try next.

Depending on what you discover from the reports you may be able to help the compiler through options, directives, and slight code modifications to take advantage of key architectural features to achieve the best performance.

The compiler reports can describe what actions have been taken and what actions cannot be taken based on the assumptions made by the compiler. Experimenting with options and directives allows you to use an understanding of the assumptions and suggest a new optimization strategy or technique.

Helping the Compiler

You can help the compiler in some important ways:

Read the appropriate reports to gain an understanding of what the compiler is doing for you and the assumptions the compiler has made with respect to your code.
Use specific options, intrinsics, libraries, and directives to get the best performance from your application.

Use the Math Kernel Library (MKL) instead of user code, or calling F90 intrinsics instead of user code.

See Applying Optimization Strategies for other suggestions.

Memory Aliasing For IA-64 Architectures

Memory aliasing is the single largest issue affecting the optimizations in the Intel® compiler for IA-64 architecture based systems. Memory aliasing is writing to a given memory location with more than one pointer. The compiler is cautious to not optimize too aggressively in these cases; if the compiler optimizes too aggressively, unpredictable behavior can result (for example, incorrect results, abnormal termination, etc.).

Since the compiler usually optimizes on a module-by-module, function-by-function basis, the compiler does not have an overall perspective with respect to variable use for global variables or variables that are passed into a function; therefore, the compiler usually assumes that any pointers passed into a function are likely to be aliased. The compiler makes this assumption even for pointers you know are not aliased. This behavior means that perfectly safe loops do not get pipelined or vectorized, and performance suffers.

There are several ways to instruct the compiler that pointers are not aliased:

Use a comprehensive compiler option, such as -fno-alias (Linux*) or /Oa (Windows*). These options instruct the compiler that no pointers in any module are aliased, placing the responsibility of program correctness directly with the developer.
Use a less comprehensive option, like -fno-fnalias (Linux) or /Ow (Windows). These options instruct the compiler that no pointers passed through function arguments are aliased.

Function arguments are a common example of potential aliasing that you can clarify for the compiler. You may know that the arguments passed to a function do not alias, but the compiler is forced to assume so. Using these options tells the compiler it is now safe to assume that these function arguments are not aliased. This option is still a somewhat bold statement to make, as it affects all functions in the module(s) compiled with the -fno-nalias (Linux) or /Ow (Windows) option.
Use the IDVEP directive. Alternatively, you might use a directive that applies to a specified loop in a function. This is more precise than specifying an entire function. The directive asserts that, for a given loop, there are no vector dependencies. Essentially, this is the same as saying that no pointers are aliasing in a given loop.

Non-Unit Stride Memory Access

Another issue that can have considerable impact on performance is accessing memory in a non-Unit Stride fashion. This means that as your inner loop increments consecutively, you access memory from non adjacent locations. For example, consider the following matrix multiplication code:

Example
!Non-Unit Stride Memory Access subroutine non_unit_stride_memory_access(a,b,c, NUM) implicit none integer :: i,j,k,NUM real :: a(NUM,NUM), b(NUM,NUM), c(NUM,NUM) ! loop before loop interchange do i=1,NUM do j=1,NUM do k=1,NUM c(j,i) = c(j,i) + a(j,k) * b(k,i) end do end do end do end subroutine non_unit_stride_memory_access

Example

!Non-Unit Stride Memory Access

subroutine non_unit_stride_memory_access(a,b,c, NUM)

implicit none

integer :: i,j,k,NUM

real :: a(NUM,NUM), b(NUM,NUM), c(NUM,NUM)

! loop before loop interchange

do i=1,NUM

do j=1,NUM

do k=1,NUM

c(j,i) = c(j,i) + a(j,k) * b(k,i)

end do

end subroutine non_unit_stride_memory_access

Notice that c[i][j], and a[i][k] both access consecutive memory locations when the inner-most loops associated with the array are incremented. The b array however, with its loops with indexes k and j, does not access Memory Unit Stride. When the loop reads b[k=0][j=0] and then the k loop increments by one to b[k=1][j=0], the loop has skipped over NUM memory locations having skipped b[k][1], b[k][2] .. b[k][NUM].

Loop transformation (sometimes called loop interchange) helps to address this problem. While the compiler is capable of doing loop interchange automatically, it does not always recognize the opportunity.

The memory access pattern for the example code listed above is illustrated in the following figure:

Assume you modify the example code listed above by making the following changes to introduce loop interchange:

Example
subroutine unit_stride_memory_access(a,b,c, NUM) implicit none integer :: i,j,k,NUM real :: a(NUM,NUM), b(NUM,NUM), c(NUM,NUM) ! loop after interchange do i=1,NUM do k=1,NUM do j=1,NUM c(j,i) = c(j,i) + a(j,k) * b(k,i) end do end do end do end subroutine unit_stride_memory_access

Example

subroutine unit_stride_memory_access(a,b,c, NUM)