Prefetching Support

Data prefetching refers to loading data from a relatively slow memory into a relatively fast cache before the data is needed by the application. Data prefetch behavior depends on the architecture:

IA-64 architecture: The Intel® compiler generally issues prefetch instructions when you specify -O1, -O2, and -O3 (Linux*) or /O1, /O2, and /03 (Windows*).
IA-32 and Intel® 64 architectures: The processor identifies simple, regular data access patterns and performs a hardware prefetch. The compiler will only issue prefetch instructions for more complicated data access patterns where a hardware prefetch is not expected.

Issuing prefetches improves performance in most cases; however, there are cases where issuing prefetch instructions might slow application performance. Experiment with prefetching; it can be helpful to turn prefetching on or off with a compiler option while leaving all other optimizations unaffected to isolate a suspected prefetch performance issue. See Prefetching with Options for information on using compiler options for prefetching data.

There are two primary methods of issuing prefetch instructions. One is by using compiler directives and the other is by using compiler intrinsics.

prefetch and noprefetch Pragmas

The prefetch and noprefetch directives are supported by Itanium® processors only. These directives assert that the data prefetches be generated or not generated for some memory references. This affects the heuristics used in the compiler. The general syntax for these pragmas is shown below:

Syntax
`#pragma noprefetch` `#pragma prefetch` `#pragma prefetch a,b`

Syntax

#pragma noprefetch

#pragma prefetch

#pragma prefetch a,b

If loop includes expression A(j), placing prefetch A in front of the loop, instructs the compiler to insert prefetches for A(j + d) within the loop. d is the number of iterations ahead to prefetch the data and is determined by the compiler. This directive is supported with optimization levels of -O1 (Linux* ) or /O1 (Windows*) or higher. Remember that -O2 or /O2 is the default optimization level.

Example
#pragma noprefetch b #pragma prefetch a for(i=0; i<m; i++) { a[i]=b[i]+1; }

Example

#pragma noprefetch b

#pragma prefetch a

for(i=0; i<m; i++)

{

a[i]=b[i]+1;

}

The following example, which is for IA-64 architecture only, demonstrates how to use the prefetch, noprefetch, and memref_control pragmas together:

Example
#define SIZE 10000 int prefetch(int a, int b) { int i, sum = 0; #pragma memref_control a:l2 #pragma noprefetch a #pragma prefetch b for (i = 0; i<SIZE; i++) sum += a[i] * b[i]; return sum; } #include <stdio.h> int main() { int i, arr1[SIZE], arr2[SIZE]; for (i = 0; i<SIZE; i++) { arr1[i] = i; arr2[i] = i; } printf("Demonstrating the use of prefetch, noprefetch,\n" "and memref_control pragma together.\n"); prefetch(arr1, arr2); return 0; }

Example

#define SIZE 10000

int prefetch(int *a, int *b)

{

int i, sum = 0;

#pragma memref_control a:l2

#pragma noprefetch a

#pragma prefetch b

for (i = 0; i<SIZE; i++)

sum += a[i] * b[i];

return sum;

}

#include <stdio.h>

int main()

{

int i, arr1[SIZE], arr2[SIZE];

for (i = 0; i<SIZE; i++) {

arr1[i] = i;

arr2[i] = i;

}

printf("Demonstrating the use of prefetch, noprefetch,\n"

"and memref_control pragma together.\n");

prefetch(arr1, arr2);

return 0;

}

Intrinsics

Before inserting compiler intrinsics, experiment with all other supported compiler options and pragmas. Compiler intrinsics are less portable and less flexible than either a compiler option or compiler pragmas.

Pragmas enable compiler optimizations while intrinsics perform optimizations. As a result, programs with pragmas are more portable, because the compiler can adapt to different processors, while the programs with intrinsics may have to be rewritten/ported for different processors. This is because intrinsics are closer to assembly programming.

Some prefetching intrinsics are:

Intrinsic	Description
__lfetch	Generate the lfetch.lfhint instruction.
__lfetch_fault	Generate the lfetch.fault.lfhint instruction.
__lfetch_excl	Generate the lfetch.excl.lfhint instruction.
__lfetch_fault_excl	Generate the lfetch.fault.excl.lfhint instruction.
__mm_prefetch	Loads one cache line of data from address a to a location closer to the processor.

See Operating System Related Intrinsics and Cacheability Support Using Streaming SIMD Extensions in the Compiler Reference for more information about these intrinsics.

The following example demonstrates how to generate an lfetch.nt2 instruction using prefetch intrinsics:

Example
for (i=i0; i!=i1; i+=is) { float sum = b[i]; int ip = srow[i]; int c = col[ip]; for(; ip<srow[i+1]; c=col[++ip]) lfetch(2, &value[ip+40]); _// mm_prefetch(&value[ip+40], 2); sum -= value[ip] * x[c]; y[i] = sum; }

Example

for (i=i0; i!=i1; i+=is) {

float sum = b[i];

int ip = srow[i];

int c = col[ip];

for(; ip<srow[i+1]; c=col[++ip])

lfetch(2, &value[ip+40]);

_// mm_prefetch(&value[ip+40], 2);

sum -= value[ip] * x[c];

y[i] = sum;

}

For Intel® Streaming SIMD Extensions-enabled processors you could also use the following Intel® SSE intrinsics:

_mm_prefetch
_mm_stream_pi
_mm_stream_ps
_mm_sfence

You can find more information about IA-64 architecture instructions by referring to the hardware and software programming resources listed in Other Resources.