Tuning Performance

This section describes several programming guidelines that can help you improve the performance of floating-point applications, including:

Handling Floating-point Array Operations in a Loop Body

Following the guidelines below will help auto-vectorization of the loop.

Reducing the Impact of Subnormal Exceptions

Subnormal floating-point values are those that are too small to be represented in the normal manner; that is, the mantissa cannot be left-justified. Subnormal values require hardware or operating system interventions to handle the computation, so floating-point computations that result in subnormal values may have an adverse impact on performance.

There are several ways to handle subnormals to increase the performance of your application:

For example, you can translate them to normalized numbers by multiplying them using a large scalar number, doing the remaining computations in the normal space, then scaling back down to the subnormal range. Consider using this method when the small subnormal values benefit the program design.

If you change the type declaration of a variable, you might also need to change associated library calls, unless these are generic; . Another strategy that might result in increased performance is to increase the amount of precision of intermediate values using the -fp-model [double|extended] option. However, this strategy might not eliminate all subnormal exceptions, so you must experiment with the performance of your application. You should verify that the gain in performance from eliminating subnormals is greater than the overhead of using a data type with higher precision and greater dynamic range.

In many cases, subnormal numbers can be treated safely as zero without adverse effects on program results. Depending on the target architecture, use flush-to-zero (FTZ) options.

IA-32 and Intel® 64 Architectures

IA-32 and Intel® 64 architectures take advantage of the FTZ (flush-to-zero) and DAZ (subnormals-are-zero) capabilities of Intel® Streaming SIMD Extensions (Intel® SSE) instructions.

By default, the Intel® Fortran Compiler inserts code into the main routine to enable FTZ and DAZ at optimization levels higher than O0. To enable FTZ and DAZ at O0, compile the source file containing PROGRAM using compiler option [Q]ftz. When the [Q]ftz option is used on IA-32-based systems with the option –mia32 (Linux*) or /arch:IA32 (Windows*), the compiler inserts code to conditionally enable FTZ and DAZ flags based on a run-time processor check. IA-32 is not available on macOS*.

Note

After using flush-to-zero, ensure that your program still gives correct results when treating subnormal values as zero.

Avoiding Mixed Data Type Arithmetic Expressions

Avoid mixing integer and floating-point (REAL) data in the same computation. Expressing all numbers in a floating-point arithmetic expression (assignment statement) as floating-point values eliminates the need to convert data between fixed and floating-point formats. Expressing all numbers in an integer arithmetic expression as integer values also achieves this. This improves run-time performance.

For example, assuming that I and J are both INTEGER variables, expressing a constant number (2.0) as an integer value (2) eliminates the need to convert the data. The following examples demonstrate inefficient and efficient code.

Inefficient code:

INTEGER I, J
  I = J / 2.0

Efficient code:

INTEGER I, J
  I = J / 2

Special Considerations for Auto-Vectorization of the Innermost Loops

Auto-vectorization of an innermost loop packs multiple data elements from consecutive loop iterations into a vector register, each of which is 128-bit (SSE) or 256 bit (AVX) in size.

Consider a loop that uses different sized data, for example, REAL and DOUBLE PRECISION. For REAL data, the compiler tries to pack data elements from four (SSE) or eight (AVX) consecutive iterations (32 bits x 4 = 128 bits, 32 bits x 8 = 256 bits). For DOUBLE PRECISION data, the compiler tries to pack data elements from two (SSE) or four (AVX) consecutive iterations (64 bits x 2 = 128 bits, 64 bits x 4 = 256 bits). Because of the mismatched number of iterations, the compiler sometimes fails to perform auto-vectorization of the loop, after trying to automatically remedy the situation.

If your attempt to auto-vectorize an innermost loop fails, it is a good practice to try using the same sized data. INTEGER and REAL are considered same sized data since both are 32-bit in size.

The following example shows code that is not auto-vectorizable:

DOUBLE PRECISION A(N), B(N) 
REAL C(N), D(N) 
DO I=1, N
   A(I)=D(I)
   C(I)=B(I) 
ENDDO

The following example shows code that is auto-vectorizable after automatic distribution into two loops:

DOUBLE PRECISION A(N), B(N) 
REAL C(N), D(N) 
DO I=1, N
   A(I)=B(I)
   C(I)=D(I) 
ENDDO

The following example shows code that is auto-vectorizable as one loop:

REAL A(N), B(N) 
REAL C(N), D(N) 
DO I=1, N
   A(I)=B(I)
   C(I)=D(I) 
ENDDO

Using Efficient Data Types

In cases where more than one data type can be used for a variable, consider selecting the data types based on the following hierarchy, listed from most to least efficient:

Note

In an arithmetic expression, you should avoid mixing integer and floating-point data.

See Also