Elemental functions are a general language construct to express a data parallel algorithm. An elemental function is written as a regular C/C++ function, and the algorithm within describes the operation on one element, using scalar syntax. The function can then be called as a regular C/C++ function to operate on an single element or it can be called in a data parallel context, providing many elements to operate on. In Intel® Cilk™ Plus, the data parallel context is provided as an array.
When you write an elemental function, the Intel® compiler generates a short vector form of the function, which can perform your function's operation on multiple arguments in a single invocation. The short vector version may be able to perform multiple operations as fast as the regular implementation performs a single one by utilizing the vector ISA in the CPU. In addition, upon invocation of the function, if the data set is large enough, the compiler may assign different copies of the elemental functions to different workers, executing them concurrently. The end result is that your data parallel operation executes on the CPU utilizing both the parallelism available in the multiple cores and the parallelism available in the vector ISA.
In order for the compiler to generate the short vector function, you need to provide an indication in your code. Use the __attribute_(vector (clauses))declaration, as follows:
__attribute__(vector (clauses)) return_type elemental_function_name(arguments)
The clauses for the vector declaration take the following values:
processor(cpuid) |
Where cpuid takes one of the following values:
|
vectorlength(n) |
Where n is a vectorlength (vl). It must be an integer that is a power of 2. The value must be 2, 4, 8, or 16. If you specify more than one n, the compiler chooses the vector length from the values specified. The vectorlength clause tells the compiler that each routine invocation at the call site should execute the computation equivalent to n times the scalar function execution. Multiple vectorlength clauses are merged as a union. |
vectorlengthfor(datatype) |
Where the datatype value must be one of the following built-in types otherwise the behavior is undefined.
When you use the vectorlengthfor clause, n is computed as the data type corresponding to the size of the vector register/data type for the processor being used. For example, vectorlengthfor(float) results in n=4 for Intel® SSE2 to Intel® SSE4.2 target processors (with packed float operations available on 128-bit XMM registers), and n=8 for Intel® AVX target processors (with packed float operations available on 256-bit YMM registers). Using vectorlengthfor(int) results in n=4 for Intel® SSE2 to Intel® AVX target processors.
|
linear(param1:step1 [, param2:step2]…) |
Where
|
scalar(param [, param,]…) |
Where param is a formal parameter of the specified function. The scalar clause tells the compiler that the values of the specified arguments can be broadcast to all iterations as a performance optimization. |
[no]mask |
The [no]mask clause tells the compiler to generate a masked vector version of the routine. |
Write the code inside your function using existing C/C++ syntax.
Typically, the invocation of an elemental function provides arrays wherever scalar arguments are specified as formal parameters. Use the array notation syntax available in Intel® Cilk™ Plus to provide the arrays succinctly. Alternatively, you can invoke the function from a _Cilk_for loop.
The following example shows how to use elemental functions to add two large arrays and store the result in a third array, taking advantage of the parallelism available in both the cores and the vectors in the CPU:
Example |
---|
//declaring the function body __attribute__(vector) double ef_add (double x, double y){ return x + y; } //invoking the function using array notations a[:] = ef_add(b[:],c[:]); //operates on the whole extent of the arrays a,b,c a[0:n:s] = ef_add(b[0:n:s],c[0:n:s]); //use the full array notation construct to also specify n as an extend and s as a stride //Use the _Cilk_for construct to invoke the elemental function in a data parallel context _Cilk_for (j = 0; j < n; ++j) { a[j] = ef_add(b[j],c[j]) } |
Only the calling code using the _Cilk_for calling syntax, is able to use all available parallelism. The array notation syntax, as well as calling the elemental function from the regular for loop, results in invoking the short vector function in each iteration and utilizing the vector parallelism but the invocation is done in a serial loop, without utilizing multiple cores.
Limitations
The following language constructs are disallowed within elemental functions:
The GOTO statement
The switch statement with16 or more case statements
Operations on classes and structs (other than member selection)
The _Cilk_spawn keyword
Expressions with array notations
Copyright © 1996-2011, Intel Corporation. All rights reserved.