We loosely define "stencil" to be iterative code were the value to be computed at the current step depends on the value(s) computed at the previous step(s).
Please visit this site for information about Dalsoft's Parallel Library - dpl - a stand alone library that

provides parallel implementation for various basic serial stencils
provides parallel implementation for conditional functions and stochastic ordering
provides parallel implementation of the Gauss–Seidel method to solve a linear system of equations
provides parallel implementation of the general 2D 5-points stencil
shows how to apply the developed technology to parallelize the code of your choice e.g. DSP filters, codes used in computational ( quantitative ) finance

Stencils, as defined above, clearly exhibit dependency between value to be computed and values already computed thus making parallel execution of such a code to be challenging. dco is capable to automatically perform parallelization of the serial code for certain stencils. In this article we will:

list the stencils that can be successfully treated by the dco
outline the requirement for efficient execution of a parallelized stencil and present the performance results observed
analyze the accuracy of the results generated by the parallelized stencil code

list the stencils that can be successfully treated

dco is capable of creating parallel code for the following sequential code sequences:

x[i] = c(i)*x[i] + b(i) + a(i)*x[i-STRIDE]
x[i] = b(i) + a(i)*x[i-STRIDE]
x[i] = c(i)*x[i] + b(i) + a(i)/x[i-STRIDE]
x[i] = b(i) + a(i)/x[i-STRIDE]
acc = b(i) + a(i)*acc
acc = b(i) + a(i)/acc

where STRIDE is an integer constant greater than 0, c(i), b(i) and a(i) are arbitrary expressions that don't depend on the values of the array x or value of acc.
Note that dco doesn't require for a stencil to be a sole code of a loop. Loop may contain another code, even more than one stencil.

requirements for the efficient execution of a parallelized stencil and performance results

requirements

Unless a stencil being parallelized is not a sole code of a loop, in order to benefit from parallelization, the dco generated parallel code of a stencil shall be executed with more than two threads available. Also, if the coefficients c(i), b(i) and a(i), explained above, reference memory locations ( e.g. are references to memory arrays ), it may be beneficial for a combined memory referenced to feet a cache of the underlying processor.

results of execution

results of stencils execution

Here we present the performance results for the set of stencils listed here.
The stencils utilized consisted of 100000 elements.
x[i] and ( if used ) the coefficients c[i] or b[i] or a[i] refer to the arrays of 100000 double precision elements initialized before the execution to a random values in the range [-1.5,1.5]. acc ( if used ) refers to a double precision value randomly initialized before the execution of a stencil.

All benchmarks were executed under Linux operating system running on the 16 cores Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz.
Every stencil of a benchmark was executed 1000 times and every benchmark was run 3 times with the time reported being neither the best nor the worst. See this for explanation on how the optimization improvements were calculated.

The columns under gcc and gcc+dco headers present execution times ( in seconds ) achieved by the gcc generated code and dco optimized code respectively. The column under the gcc+dco/gcc header lists the improvements achieved by utilizing dco over the gcc generated code.
Each row represents data for a certain stencil.

	gcc	gcc+dco	gcc+dco/gcc
*x[i] = c[i]x[i] + b[i] + a[i]x[i-1]*	0.608	0.167	73%
*x[i] = b[i] + a[i]x[i-1]**	0.474	0.146	69%
*x[i] = c[i]x[i] + b[i] + a[i]/x[i-1]**	0.972	0.243	75%
x[i] = b[i] + a[i]/x[i-1]	0.828	0.214	74%
*acc = b[i] + a[i]acc**	0.298	0.137	54%
acc = b[i] + a[i]/acc	0.642	0.245	62%

results of execution of code that uses stencils

dco is also able to perform parallelization of the serial code that contains a stencil. For example, the following code was successfully parallelized by dco:

 for ( i = 1 ; i<100000 ; i++ )
 {
  x[i] = b[i] + a[i]/x[i-1];
  c[i] += exp( cos( x[i] ) );
 }

Parallelization of the stencil x[i] = b[i] + a[i]/x[i-1], by itself being higly beneficial, also allowed to create parallel code for the whole loop.
The following execution results were observed for this loop:

gcc	3.573
gcc+dco	0.779
dco+gcc/gcc	78%

the accuracy of the results generated by the parallelized stencil code

Stencil, being a serial code with the dependency between value to be computed and value(s) already computed, may not be executed on a parallel machine. To achieve that the stencils code shall be mathematically transformed into a code that is logically identical to the original serial code but with the cited dependencies being eliminated. Therefore, due to inexact nature of the floating point execution, the results generated by the parallel code may differ from that generated by the original code. However, for exactly the same reason, the results generated by the original serial code may differ from the precise desirable results that are attempted to be calculated.
Here we will analyze the deviation of the results produced by the dco generated parallel code from the results produced by the original serial code. We will also analyze the deviation of the both those results from the precise "theoretical" result.

how was it done

For the comprehensive description of the process of verification, see the section testing dco parallelization of a stencil in the chapter Examples of this document.

The verification utilizes number of the software packages developed by Dalsoft:

Dalsoft High Precision library - dhp -that allows to perform floating point calculations with an unlimited ( arbitrary ) precision.
Dalsoft Random Testing package - drt - that provides a framework for automated software testing ( fuzzing ).

Three execution results were created:

dco result - result of the execution by the parallel code created by dco for a stencil being parallelized
exact result - result of the execution of the above code using double precision values
precise result - result of the execution of the above code using Dalsoft High Precision ( dhp ) package

dco result is generated by the parallel code and, usualy, is fast. exact result is what the standard implementation of a stencil generates. Due to the inexact nature of the double precision floating point execution, it is not clear how accurate these results are. We assume that using high precision data allows to generate more accurate - precise result.

The Dalsoft Random Testing ( drt ) package was used to establish working framework to carry out the verification repeatedly performing the calculations of the three execution results for randomly generated input data and collecting the statistics. At the end drt prints the report that, among a lot of other data, contains:

Max relative deviation

maximum relative difference between dco result and exact result detected during verification process
MaxRelativeDiff ( MRD )

maximum relative difference between dco result and precise result detected during verification process
maximum relative difference between exact result and precise result detected during verification process
dco results better

number of cases during verification process when the relative difference between dco result and precise result was smaller than relative difference between exact result and precise result
exact results better

number of cases during verification process when the relative difference between exact result and precise result was smaller than relative difference between dco result and precise result
error count

number of cases during verification process when the relative difference between dco result and exact result exceeds the established threshold.

results of verification

The following are the results of the verification method that was described in the previous paragraph. The stencils utilized consisted of 100000 elements.
x[i] and ( if used ) the coefficients c[i] or b[i] or a[i] refer to the arrays of 100000 double precision elements initialized at each iteration of the verification process to a random values in the range [-1.5,1.5]. acc ( if used ) refers to a double precision value randomly initialized before the execution of a stencil.
We establish the threshold to be 1e-16 marking as an "error" every case when the relative difference between dco result and exact result exceeds this threshold. Here we present only information provider by drt that was explained in the previous section; we report numeric counts as pair of of the count itself to the total number of cases attempted - note that subtracting from total number of cases attempted dco results better and exact results better you will get the number of cases where dco result are the same as exact result. Visit this to see all the data provided by drt.

x[i] = c(i)*x[i] + b(i) + a(i)*x[i-1]

Max relative deviation	2.382528e-12
MRD dco-precise	0.00000001751523006384221755666982
MRD exact-precise	0.00000001751523006384221755666982
dco results better	86/21900000
exact results better	311/21900000
error count	106/21900000

x[i] = b(i) + a(i)*x[i-1]

Max relative deviation	3.430467e-14
MRD dco-precise	0.00000000658414747780821819245956
MRD exact-precise	0.00000000658414747780821819245956
dco results better	47/22800000
exact results better	188/22800000
error count	55/22800000

x[i] = c(i)*x[i] + b(i) + a(i)/x[i-1]

Max relative deviation	3.878528e-10
MRD dco-precise	0.00000027455823449607702741336274
MRD exact-precise	0.00000027455823449607702741336274
dco results better	1056/183500000
exact results better	3795/183500000
error count	1098/183500000

x[i] = b(i) + a(i)/x[i-1]

Max relative deviation	3.886594e-13
MRD dco-precise	0.00000000178682037959313416769158
MRD exact-precise	0.00000000178682037959313416769158
dco results better	157/19000000
exact results better	479/19000000
error count	121/19000000

acc = b(i) + a(i)*acc

Max relative deviation	0
MRD dco-precise	0.00000000000037207067653795213940
MRD exact-precise	0.00000000000037207067653795213940
dco results better	0/249600000
exact results better	0/249600000
error count	0/249600000

acc = b(i) + a(i)/acc

Max relative deviation	0
MRD dco-precise	0.00000000000060847490332735719626
MRD exact-precise	0.00000000000060847490332735719626
dco results better	0/75400000
exact results better	0/75400000
error count	0/75400000