Performance Enhancement Strategies

Improving performance starts with identifying the characteristics of the application you are attempting to optimize. The following table lists some common application characteristics, indicates the overall potential performance impact you can expect, and provides suggested solutions to try. These strategies have been found to be helpful in many cases; experimentation is key with these strategies.

In the context of this discussion, view the potential impact categories as an indication of the possible performance increases that might be achieved when using the suggested strategy. It is possible that application or code design issues will prohibit achieving the indicated increases; however, the listed impacts are generally true. The impact categories are defined in terms of the following performance increases, when compared to the initially tested performance:

Significant: more than 50%
High: up to 50%
Medium: up to 25%
Low: up to 10%

The following table is ordered by application characteristics and then by strategy with the most significant potential impact.

Application Characteristics	Impact	Suggested Strategies
Technnical Applications
Technical applications with loopy code	High	Technical applications are those programs that have some subset of functions that consume a majority of total CPU cycles in loop nests. Target loop nests using -03 (Linux* and Mac OS* X) or /O3 (Windows*) to enable more aggressive loop transformations and prefetching. Use High-Level Optimization (HLO) reporting to determine which HLO optimizations the compiler elected to apply. See High-Level Optimization Report.
(same as above) IA-64 architecture only	High	For -O2 and -O3 (Linux) or /O2 and /O3 (Windows), use the swp report to determine if Software Pipelining occurred on key loops, and if not, why not. You might be able to change the code to allow software pipelining under the following conditions: If recurrences are listed in the report that you suspect do not exist, eliminate aliasing problems (for example, using the restrict keyword), or use ivdep pragma on the loop. If the loop is too large or runs out of registers, you might be able to distribute the loop into smaller segments; distribute the loop manually or by using the distribute pragma. If the compiler determines the Global Acyclic Scheduler can produce better results but you think the loop should still be pipelined, use the SWP pragma on the loop.
(same as above) IA-32 and Intel® 64 architectures only	High	See Vectorization Overview and the remaining topics in the Auto-Vectorization section for applicable options. See Vectorization Report for specific details about when you can change code.
(same as above)	Medium	Use PGO profile to guide other optimizations. See Profile-guided Optimizations Overview.
Applications with many denormalized floating-point value operations	Significant	Experiment with -fp-model fast=2 (Linux and Mac OS X) or /fp:fast=2 or -ftz (Linux and Mac OS X) or /Qftz (Windows). The resulting performance increases can adversely affect floating-point calculation precision and reproducibility. See Floating-point Operations for more information about using the floating point options supported in the compiler.
Sparse matrix applications	Medium	See the suggested strategy for memory pointer disambiguation (below). Use prefetch pragma or prefetch intrinsics. Experiment with different prefetching schemes on indirect arrays. See HLO Overview or Data Prefetching starting places for using prefetching.
Server Applications
Server application with branch-centric code and a fairly flat profile	Medium	Flat profile applications are those applications where no single module seems to consume CPU cycles inordinately. Use PGO to communicate typical hot paths and functions to the compiler, so the Intel® compiler can arrange code in the optimal manner. Use PGO on as much of the application as is feasible. See Profile-guided Optimizations Overview.
Very large server applications (similar to above)	Low	Use -O1 (Linux and Mac OS X) or /O1 (Windows) to optimize for this type of application. Streamlines code in the most generic manner available. This strategy reduces the amount of code being generated, disables inlining, disables speculation, and enables caching of as much of the instruction code as possible.
Database engines	Medium	Use -O1 (Linux and Mac OS X) or /O1 (Windows) and PGO to optimize the application code.
(same as above)	Medium	Use -ipo (Linux and Mac OS X) or /Qipo (Windows) on entire application. See Interprocedural Optimizations Overview.
Other Application Types
Applications with many small functions that are called from multiple locations	Low	Use -ip (Linux and Mac OS X) or /Qip (Windows) to enable inter-procedural inlining within a single source module. Streamlines code execution for simple functions by duplicating the code within the code block that originally called the function. This will increase application size. As a general rule, do not inline large, complicated functions. See Interprocedural Optimizations Overview.
(same as above)	Low	Use -ipo (Linux and Mac OS X) or /Qipo (Windows) to enable inter-procedural inlining both within and between multiple source modules. You might experience an additional increase over using -ip (Linux and Mac OS X) or /Qip (Windows). Using this option will increase link time due to the extended program flow analysis that occurs. Use Interprocedural Optimization (IPO) to attempt to perform whole program analysis, which can help memory pointer disambiguation.

Apart from application-specific suggestions listed above, there are many application-, OS/Library-, and hardware-specific recommendations that can improve performance as suggested in the following tables:

Application-specific Recommendations

Application Area	Impact	Suggested Strategies
Cache Blocking	High	Use -O3 (Linux and Mac OS X) or /O3 (Windows) to enable automatic cache blocking; use the HLO report to determine if the compiler enabled cache blocking automatically. If not consider manual cache blocking. See Cache Blocking.
Compiler pragmas for better alias analysis	Medium	Ignore vector dependencies. Use ivdep and other pragmas to increase application speed. See .
Memory pointer disambiguation compiler keywords and options	Medium	Use restrict keyword and the -restrict (Linux and Mac OS X) or /Qrestrict (Windows) option to disambiguate memory pointers. If you use the restrict keyword in your source code, you must use -restrict (Linux and Mac OS X) or /Qrestrict (Windows) option during compilation. Instead of using the restrict keyword and option, you can use the following compiler options: -fno-fnalias (Linux and Mac OS X) -ansi-alias (Linux and Mac OS X) or /Qansi-alias (Windows) /Oa (Windows) /Ow (Windows) -fargument-alias (Linux and Mac OS X) or /Qalias-args (Windows)
Light-weight volatile	Low	Some application use volatile to ensure memory operations occur, the application does not need strong memory ordering in the hardware. IA-64 architecture: Use -mno-serialize-volatile.
Math functions	Low	Use float intrinsics for single precision data type, for example, sqrtf() not sqrt(). Call Math Kernel Library (MKL) instead of user code. Call F90 intrinsics instead of user code (to enable optimizations).
Use intrinsics instead of calling a function in assembly code	Low	The Intel® compiler includes intrinsics; use these intrinsics to optimize your code. Using compiler intrinsics can help to increase application performance while helping to your code to be more portable.

Library/OS Recommendations

Area	Impact	Description
Library	Low	Systems on IA-64 architecture only. If you have been using the setjump function, you might consider using the light (_setjmp) version instead of the heavy (setjmp) version of the function to reduce the amount of floating-point state saved in the setjmp buffer.
Symbol preemption	Low	Linux has a less performance-friendly symbol preemption model than Windows. Linux uses full preemption, and Windows uses no preemption. Use -fminshared -fvisibility=protected. See Symbol Visibility Attribute Options.
Memory allocation	Low	Using third-party memory management libraries can help improve performance for applications that require extensive memory allocation.

Area

Impact

Description

Library

Low

Systems on IA-64 architecture only. If you have been using the setjump function, you might consider using the light (_setjmp) version instead of the heavy (setjmp) version of the function to reduce the amount of floating-point state saved in the setjmp buffer.

Symbol preemption

Low

Linux has a less performance-friendly symbol preemption model than Windows. Linux uses full preemption, and Windows uses no preemption. Use -fminshared -fvisibility=protected.

See Symbol Visibility Attribute Options.

Memory allocation

Low

Using third-party memory management libraries can help improve performance for applications that require extensive memory allocation.

Hardware/System Recommendations

Component	Impact	Description
Disk	Medium	Consider using more advanced hard drive storage strategies. For example, consider using SCSI instead of IDE. Consider using the appropriate RAID level. Consider increasing the number hard drives in your system.
Memory	Low	You can experience performance gains by distributing memory in a system. For example, if you have four open memory slots and only two slots are populated, populating the other two slots with memory will increase performance.
Processor		For many applications, performance scales is directly affected by processor speed, the number of processors, processor core type, and cache size.

Component

Impact

Description

Disk

Medium

Consider using more advanced hard drive storage strategies. For example, consider using SCSI instead of IDE.

Consider using the appropriate RAID level.

Consider increasing the number hard drives in your system.

Memory

Low

You can experience performance gains by distributing memory in a system. For example, if you have four open memory slots and only two slots are populated, populating the other two slots with memory will increase performance.

Processor

For many applications, performance scales is directly affected by processor speed, the number of processors, processor core type, and cache size.