Worksharing Using OpenMP*

To get the maximum performance benefit from a processor with multi-core and Hyper-Threading Technology, an application needs to be executed in parallel. Parallel execution requires threads, and threading an application is not a simple thing to do; using OpenMP* can make the process a lot easier. Using the OpenMP pragmas, most loops with no loop-carried dependency can be threaded with one simple statement. This topic explains how to start using OpenMP to parallelize loops, which is also called worksharing.

Most loops can be threaded by inserting one pragma immediately prior to the loop. Further, by leaving the details to the compiler and OpenMP, you can spend more time determining which loops should be threaded and how to best restructure the algorithms for maximum performance. The maximum performance of OpenMP is realized when it is used to thread hotspots, the most time-consuming loops in your application.

The power and simplicity of OpenMP is demonstrated by looking at an example. The following loop converts a 32-bit RGB (red, green, blue) pixel to an 8-bit gray-scale pixel. One pragma, which has been inserted immediately before the loop, is all that is needed for parallel execution.

Example

#pragma omp parallel for

for (i=0; i < numPixels; i++)

{

pGrayScaleBitmap[i] = (unsigned BYTE)

(pRGBBitmap[i].red * 0.299 +

pRGBBitmap[i].green * 0.587 +

pRGBBitmap[i].blue * 0.114);

}

First, the example uses worksharing, which is the general term used in OpenMP to describe distribution of work across threads. When worksharing is used with the for construct, as shown in the example, the iterations of the loop are distributed among multiple threads so that each loop iteration is executed exactly once and in parallel by one or more threads. OpenMP determines the number of threads to create and how to best create, synchronize, and destroy them. OpenMP places the following five restrictions on which loops can be threaded:

Although these restrictions might sound somewhat limiting, non-conforming loops can easily be rewritten to follow these restrictions.

Basics of Compilation

Using the OpenMP pragmas requires an OpenMP-compatible compiler and thread-safe libraries. Adding the /Qopemp option to the compiler instructs the compiler to pay attention to the OpenMP pragmas and to insert threads. If you omit the /Qopenmp option, the compiler will ignore OpenMP pragmas, which provides a very simple way to generate a single-threaded version without changing any source code.

For conditional compilation, the compiler defines _OPENMP. If needed, the define can be tested as shown in the following example.

Example

#ifdef _OPENMP

fn();

#endif

A Few Simple Examples

The following examples illustrate how simple OpenMP is to use. In common practice, additional issues need to be addressed, but these example illustrate a good starting place.

In the first example, the following loop clips an array to the range 0...255.

Example

// clip an array to 0 <= x <= 255

for (i=0; i < numElements; i++)

{

if (array[i] < 0)

array[i] = 0;

else if (array[i] > 255)

array[i] = 255;

}

You can thread it using a single OpenMP pragma; insert the following pragma immediately prior to the loop:

Example

#pragma omp parallel for

for (i=0; i < numElements; i++)

{

if (array[i] < 0)

array[i] = 0;

else if (array[i] > 255)

array[i] = 255;

}

In the second example, the loop generates a table of square roots for the numbers 0...100.

Example

double value;

double roots[100];

for (value = 0.0; value < 100.0; value ++)

{

roots[(int)value] = sqrt(value);

}

Thread the loop by changing the loop variable to a signed integer or unsigned integer and inserting a #pragma omp parallel pragma.

Example

int value;

double roots[100];

#pragma omp parallel for

for (value = 0; value < 100; value ++)

{

roots[value] = sqrt((double)value);

}

Avoiding Data Dependencies and Race Conditions

When a loop meets all five loop restrictions (listed above) and the compiler threads the loop, the loop still might not work correctly due to the existence of data dependencies.

Data dependencies exist when different iterations of a loop-more specifically a loop iteration that is executed on a different thread-reads or writes shared memory. Consider the following example that calculates factorials.

Example

// Each loop iteration writes a value that a different iteration reads.

#pragma omp parallel for

for (i=2; i < 10; i++)

{

factorial[i] = i * factorial[i-1];

}

The compiler will thread this loop, but the threading will fail because at least one of the loop iterations is data-dependent upon a different iteration. This situation is referred to as a race condition. Race conditions can only occur when using shared resources (like memory) and parallel execution. To address this problem either rewrite the loop or pick a different algorithm, one that does not contain the race condition.

Race conditions are difficult to detect because, for a given case or system, the variables might win the race in the order that happens to make the program function correctly. Because a program works once does not mean that the program will work under all conditions. Testing your program on various machines, some with Hyper-Threading Technology and some with multiple physical processors, is a good starting point to help identify race conditions.

Traditional debuggers are useless for detecting race conditions because they cause one thread to stop the race while the other threads continue to significantly change the runtime behavior; however, thread checking tools can help.

Managing Shared and Private Data

Nearly every loop (in real applications) reads from or writes to memory; it's your responsibility, as the developer, to instruct the compiler what memory should be shared among the threads and what memory should be kept private. When memory is identified as shared, all threads access the exact same memory location. When memory is identified as private, however, a separate copy of the variable is made for each thread to access in private. When the loop ends, the private copies are destroyed. By default, all variables are shared except for the loop variable, which is private. Memory can be declared as private in the following two ways:

The following loop fails to function correctly because the variable temp is shared. It needs to be private.

Example

// Variable temp is shared among all threads, so while one thread

// is reading variable temp another thread might be writing to it

#pragma omp parallel for

for (i=0; i < 100; i++)

{

temp = array[i];

array[i] = do_something(temp);

}

The following two examples both declare the variable temp as private memory, which solves the problem.

Example

#pragma omp parallel for

for (i=0; i < 100; i++)

{

int temp; // variables declared within a parallel construct

// are, by definition, private

temp = array[i];

array[i] = do_something(temp);

>}

The temporary variable can also be made private in the following way:

Example

#pragma omp parallel for private(temp)

for (i=0; i < 100; i++)

{

temp = array[i];

array[i] = do_something(temp);

}

Every time you use OpenMP to parallelize a loop, you should carefully examine all memory references, including the references made by called functions. Variables declared within a parallel construct are defined as private except when they are declared with the static declarator, because static variables are not allocated on the stack.

Reductions

Loops that accumulate a value are fairly common, and OpenMP has a specific clause to accommodate them. Consider the following loop that calculates the sum of an array of integers.

Example

sum = 0;

for (i=0; i < 100; i++)

{

sum += array[i]; // this variable needs to be shared to generate

// the correct results, but private to avoid

// race conditions from parallel execution

}

The variable sum in the previous loop must be shared to generate the correct result, but it also must be private to permit access by multiple threads. OpenMP provides the reduction clause that is used to efficiently combine the mathematical reduction of one or more variables in a loop. The following example demonstrates how the loop can use the reduction clause to generate the correct results.

Example

sum = 0;

#pragma omp parallel for reduction(+:sum)

for (i=0; i < 100; i++)

{

sum += array[i];

}

In the case of the example listed above, the reduction provides private copies of the variable sum for each thread, and when the threads exit, it adds the values together and places the result in the one global copy of the variable.

The following table lists the possible reductions, along with the initial variables-which is also the mathematical identify value-for the temporary private variables.

Operation

Temporary private variable initialization

+ (addition)

0

- (subtraction)

0

* (multiplication)

1

& (bitwise and)

~0

| (bitwise or)

0

^ (bitwise exclusive or)

0

&& (conditional and)

1

|| (conditional or)

0

Multiple reductions in a loop are possible by specifying comma-separated variables and reductions on a given parallel construct. Reductions variables must meet the following requirements:

Load Balancing and Loop Scheduling

Load balancing, the equal division of work among threads, is among the most important attributes for parallel application performance. Load balancing is extremely important, because it ensures that the processors are busy most, if not all, of the time. Without a balanced load, some threads may finish significantly before others, leaving processor resources idle and wasting performance opportunities.

Within loop constructs, poor load balancing is usually caused by variations in compute time among loop iterations. It is usually easy to determine the variability of loop iteration compute time by examining the source code. In most cases, you will see that loop iterations consume a uniform amount of time. When that is not true, it may be possible to find a set of iterations that consume similar amounts of time. For example, sometimes the set of all even iterations consumes about as much time as the set of all odd iterations. Similarly, it might be the case that the set of the first half of the loop consumes about as much time as the second half. In contrast, it might be impossible to find sets of loop iterations that have a uniform execution time. Regardless of the case, you should provide this extra loop scheduling information to OpenMP so it can better distribute the iterations of the loop across the threads (and therefore processors) for optimum load balancing.

By default OpenMP assumes that all loop iterations consume the same amount of time. This assumption leads OpenMP to distribute the iterations of the loop among the threads in roughly equal amounts and in such a way as to minimize the chances of memory conflicts due to false sharing. This behavior is possible because loops generally touch memory sequentially, so splitting up the loop in large chunks-like the first half and second half when using two threads-will result in the least chance for overlapping memory. While this may be the best choice for memory issues, it may be bad for load balancing. Unfortunately, the reverse is also true; what might be best for load balancing may be bad for memory performance. You must strike a balance between optimal memory usage and optimal load balancing by measuring the performance to see what method produces the best results.

Use the following general on the parallel construct syntax to instruct OpenMP to loop schedule:

Example

#pragma omp parallel for schedule(kind [, chunk size])

Four different loop scheduling types (kinds) can be provided to OpenMP, as shown in the following table. The optional parameter (chunk), when specified, must be a loop-invariant positive integer.

Kind

Description

static

Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. By default, chunk size is loop count/number of threads.

Set chunk to 1 to interleave the iterations.

dynamic

Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue.

By default, the chunk size is 1. Be careful when using this scheduling hint because of the extra overhead requirement.

guided

Similar to dynamic scheduling, but the chunk size starts off large and shrinks in an effort to reduce the amount of time threads have to go to the work queue to get more work. The optional chunk parameter specifies them minimum size chunk to use.

By default the chunk size is approximately the loop count/number of threads.

auto

When schedule (auto) is specified, the decision regarding scheduling is delegated to the compiler. The programmer gives the compiler the freedom to choose any possible mapping of iterations to threads in the team.

runtime

Uses the OMP_SCHEDULE environment variable to specify which one of the three loop-scheduling types should be used.

OMP_SCHEDULE is a string formatted exactly the same as would appear on the parallel construct.

Assume that you want to parallelize the following loop.

Example

for (i=0; i < NumElements; i++)

{

array[i] = StartVal;

StartVal++;

}

As written, the loop contains a data dependency, making it impossible to parallelize without a quick change. The new loop, shown below, fills the array in the same manner, but without data dependencies. The new loop can also be written using the SIMD instructions.

Example

#pragma omp parallel for

for (i=0; i < NumElements; i++)

{

array[i] = StartVal + i;

}

Observe that the code is not 100% identical because the value of variable StartVal is not incremented. As a result, when the parallel loop is finished, the variable will have a value different from the one produced by the serial version. If the value of StartVal is needed after the loop, the additional statement, shown below, is needed.

Example

// This works and is identical to the serial version.

#pragma omp parallel for

for (i=0; i < NumElements; i++)

{

array[i] = StartVal + i;

}

StartVal += NumElements;

Tasking Model

The tasking model implemented by the Intel® compiler enables OpenMP* to parallelize a large range of applications. The directives used for tasking are:

#pragma omp task [clause[[,] clause] ...] new-line

structured-block

where clause is one of the following:

#pragma omp taskwait new-line

The #pragma omp task directive defines an explicit task region as follows:

Example

void test1(LIST *head){

#pragma intel omp parallel shared(head)

{

#pragma omp single

{ LIST *p = head;

while (p != NULL) {

#pragma omp task firstprivate(p)

{

do_work1(p);

}

p = p->next;

}

}

}

}

The binding thread set of the task region is the current parallel team. A task region binds to the innermost enclosing PARALLEL region. When a thread encounters a task construct, a task is generated from the structured block enclosed in the construct. The encountering thread may immediately execute the task, or defer its execution. A task construct may be nested inside an outer task, but the task region of the inner task is not a part of the task region of the outer task.

Clauses Used

The TASK directive takes an optional comma-separated list of clauses. The data environment of the task is created according to the data-sharing attribute clauses on the task construct and any defaults that apply. These clauses are in previous sections: Below example shows a way to generate N tasks with one thread and execute them with the threads in the parallel team:

Example

#pragma omp parallel shared(data)

{

#pragma omp single private(i)

{

for (i=0, i<N; i++) {

#pragma omp task firstprivate(i shared(data)

{

do_work(data(i));

}

}

}

}

Task Scheduling

When a thread reaches a task scheduling point, it may perform a task switch, beginning or resuming execution of a different task bound to the current team. Task scheduling points are implied at the following locations:

When a thread encounters a task scheduling point it may do one of the following:

If more than one of the above choices is available, it is unspecified as to which will be chosen.

Task Scheduling Constraints


  1. An explicit task whose construct contained an if clause whose if clause expression evaluated to false is executed immediately after generation of the task.

  2. Other scheduling of new tied tasks is constrained by the set of task regions that are currently tied to the thread, and that are not suspended in a barrier region. If this set is empty, any new tied task may be scheduled. Otherwise, a new tied task may be scheduled only if it is a descendant of every task in the set. A program relying on any other assumption about task scheduling is non-conforming.

Note iconNote

Task scheduling points dynamically divide task regions into parts. Each part is executed uninterruptedly from start to end. Different parts of the same task region are executed in the order in which they are encountered. In the absence of task synchronization constructs, the order in which a thread executes parts of different schedulable tasks is unspecified.

A correct program must behave correctly and consistently with all conceivable scheduling sequences that are compatible with the rules above.

TASKWAIT pragma

The TASKWAIT directive specifies a wait on the completion of child tasks generated since the beginning of the current task. A taskwait region binds to the current task region. The binding thread set of the taskwait region is the encountering thread.

The taskwait region includes an implicit task scheduling point in the current task region. The current task region is suspended at the task scheduling point until execution of all its child tasks generated before the taskwait region are completed.

Example

#pragma omp task

{ ...

#pragma omp task

{ do_work1(); }

#pragma omp task

{ ...

#pragma omp task

{ do_work2(); }

...

}

#pragma omp taskwait

...

}

For more details on these directives, see OpenMP* C++ Compiler Directives.