We loosely define "stencil" to be iterative code were the value to be computed at the current step depends on the value(s) computed at the previous step(s).
Stencils, as defined above, clearly exhibit dependency between value to be computed and values already computed thus making parallel execution of such a code to be challenging. dco is capable to automatically perform parallelization of the serial code for certain stencils. In this article we will:
Please visit this page for information about Dalsoft's Parallel Library - dpl - a stand alone library that provides parallel execution for a number of sequential functions/algorithms ( that includes various stencils ) and implements parallel execution of the Gauss-Seidel method to solve a linear system of equations.

list the stencils that can be successfully treated

dco is capable of creating parallel code for the following sequential code sequences:
where STRIDE is an integer constant greater than 0, c(i), b(i) and a(i) are arbitrary expressions that don't depend on the values of the array x or value of acc.
Note that dco doesn't require for a stencil to be a sole code of a loop. Loop may contain another code, even more than one stencil.

requirements for the efficient execution of a parallelized stencil and performance results

requirements

Unless a stencil being parallelized is not a sole code of a loop, in order to benefit from parallelization, the dco generated parallel code of a stencil shall be executed with more than two threads available. Also, if the coefficients c(i), b(i) and a(i), explained above, reference memory locations ( e.g. are references to memory arrays ), it may be beneficial for a combined memory referenced to feet a cache of the underlying processor.

results of execution

results of stencils execution

Here we present the performance results for the set of stencils listed here.
The stencils utilized consisted of 100000 elements.
x[i] and ( if used ) the coefficients c[i] or b[i] or a[i] refer to the arrays of 100000 double precision elements initialized before the execution to a random values in the range [-1.5,1.5]. acc ( if used ) refers to a double precision value randomly initialized before the execution of a stencil.

All benchmarks were executed under Linux operating system running on the 16 cores Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz.
Every stencil of a benchmark was executed 1000 times and every benchmark was run 3 times with the time reported being neither the best nor the worst. See this for explanation on how the optimization improvements were calculated.

The columns under gcc and gcc+dco headers present execution times ( in seconds ) achieved by the gcc generated code and dco optimized code respectively. The column under the gcc+dco/gcc header lists the improvements achieved by utilizing dco over the gcc generated code.
Each row represents data for a certain stencil.


gcc gcc+dco gcc+dco/gcc
x[i] = c[i]*x[i] + b[i] + a[i]*x[i-1] 0.608 0.167 73%
x[i] = b[i] + a[i]*x[i-1] 0.474 0.146 69%
x[i] = c[i]*x[i] + b[i] + a[i]/x[i-1] 0.972 0.243 75%
x[i] = b[i] + a[i]/x[i-1] 0.828 0.214 74%
acc = b[i] + a[i]*acc 0.298 0.137 54%
acc = b[i] + a[i]/acc 0.642 0.245 62%

results of execution of code that uses stencils

dco is also able to perform parallelization of the serial code that contains a stencil. For example, the following code was successfully parallelized by dco:
 for ( i = 1 ; i<100000 ; i++ )
{
x[i] = b[i] + a[i]/x[i-1];
c[i] += exp( cos( x[i] ) );
}

Parallelization of the stencil x[i] = b[i] + a[i]/x[i-1], by itself being higly beneficial, also allowed to create parallel code for the whole loop.
The following execution results were observed for this loop:

gcc 3.573
gcc+dco 0.779
dco+gcc/gcc 78%

the accuracy of the results generated by the parallelized stencil code

Stencil, being a serial code with the dependency between value to be computed and value(s) already computed, may not be executed on a parallel machine. To achieve that the stencils code shall be mathematically transformed into a code that is logically identical to the original serial code but with the cited dependencies being eliminated. Therefore, due to inexact nature of the floating point execution, the results generated by the parallel code may differ from that generated by the original code. However, for exactly the same reason, the results generated by the original serial code may differ from the precise desirable results that are attempted to be calculated.
Here we will analyze the deviation of the results produced by the dco generated parallel code from the results produced by the original serial code. We will also analyze the deviation of the both those results from the precise "theoretical" result.

how was it done

For the comprehensive description of the process of verification, see the section testing dco parallelization of a stencil in the chapter Examples of this document.

The verification utilizes number of the software packages developed by Dalsoft: Three execution results were created: dco result is generated by the parallel code and, usualy, is fast. exact result is what the standard implementation of a stencil generates. Due to the inexact nature of the double precision floating point execution, it is not clear how accurate these results are. We assume that using high precision data allows to generate more accurate - precise result.

The Dalsoft Random Testing ( drt ) package was used to establish working framework to carry out the verification repeatedly performing the calculations of the three execution results for randomly generated input data and collecting the statistics. At the end drt prints the report that, among a lot of other data, contains:

results of verification

The following are the results of the verification method that was described in the previous paragraph. The stencils utilized consisted of 100000 elements.
x[i] and ( if used ) the coefficients c[i] or b[i] or a[i] refer to the arrays of 100000 double precision elements initialized at each iteration of the verification process to a random values in the range [-1.5,1.5]. acc ( if used ) refers to a double precision value randomly initialized before the execution of a stencil.
We establish the threshold to be 1e-16 marking as an "error" every case when the relative difference between dco result and exact result exceeds this threshold. Here we present only information provider by drt that was explained in the previous section; we report numeric counts as pair of of the count itself to the total number of cases attempted - note that subtracting from total number of cases attempted dco results better and exact results better you will get the number of cases where dco result are the same as exact result. Visit this to see all the data provided by drt.