Modern processors offer multiple execution units ( "cores"  ) which allow simultaneous ( "parallel" ) execution of multiple code sequences. dco utilizes this functionality and offers powerful automatic parallelization that capable to identify code-patterns that are suitable for parallelization, and create the optimized code that will be executed by all the cores available. The part of dco that implements auto-parallelizer is called PARALLator ( pronounced paral-a-ter ). In this article we will:
The current implementation provides parallelization for the x86 processors. Work on the LLVM version of the auto-prarallelizer is in progress.
Please refer to the following article and accompanying slide show for a brief discussion about our auto parallelization technology.
Please visit this page for information about parallelization by dco of stencils, results of execution of parallelized stencil code and study about the accuracy of the results generated by the parallelized stencil code.
Please visit this page for information about Dalsoft's Parallel Library - dpl - a stand alone library that provides parallel execution for a number of sequential functions/algorithms ( that includes various stencils ) and implements parallel execution of the Gauss-Seidel method to solve a linear system of equations.

how the PARALLator operates

dco implemented automatic parallelization is suitable for shared memory multi-core systems with uniform memory access ( UMA ), meaning multiple identical processors ( cores ) on a single chip that have equal access and access times to memory.

dco utilizes threads model with a single process creating and managing multiple threads that are executed in parallel by different cores. It is implemented by calling subroutines from the OpenMP library thus requiring this library to be linked ( one way to achieve that is to specify -fopenmp option during linking ).

The PARALLator is enabled by the
-parallel command line option of dco. PARALLator identifies code sequences suitable for parallelization performing, if necessary, the following actions:
While generating parallel code dco performs all necessary resource saving/restoration, memory redirection, code splitting, code modification, thread startup/termination etc.

compare with OpenMP

OpenMP is an application program interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism. It has become a successful user-model for developing parallel applications and has recently emerged as the de facto standard for shared-memory parallel programming. Therefore it is important to distinguish dco from OpenMP and to establish advantages of the use of dco over OpenMP.

dco and OpenMP are used very differently. OpenMP requires from user to
dco's parallel code generation is fully automated and does not require any involvement from the user.

Besides the differences in use, dco is capable to parallelize code that OpenMP is not capable; see this for example.

parallel code explained

dco generated parallel program for a sequential code consists of two parts:
  1. master thread
  2. parallel thread

master thread

parallel thread

parallel thread acts as a separate subroutine that:

overhead of the parallel code

Both, master and parallel, threads use subroutines from the OpenMP library some of which come at a considerable cost. This overhead may significantly impact the performance of parallelized code. While analysing the code to be parallelized, dco considers the overhead of the use of OpenMP subroutines avoiding parallelization if it determines that the cost of parallel execution exceeds it benefits.

It was estimated that on the Core2 system the overhead of utilizing OpenMP subroutines is 3500 clocks. It means that, for example, on the dual core system parallelization may be beneficial only for sequential loops running for at least 7000 clocks:

if execution time of a sequential loop is ExecTime clocks, execution time of the parallel code on the dual core system will be ( roughly ) ExecTime/2 + Overhead.
for parallelization to be beneficial, execution time of the parallel code should not exceed the execution time of the sequential code:
Overhead + ExecTime/2 < ExecTime
thus
ExecTime > 2*Overhead

Note that cost of the OpenMP library routines,utilized by dco, is only one ( but not the only ) factor that affects the performance of the parallel code.

peculiarities of parallel programming

The parallel code often behaves in a manner that is not intuitive or predictable. The following articles we wrote may help you to master deeper understanding of parallel programming.

optimizations achieved by the PARALLator

Here we explain the evaluation process and present the performance results for the set of selected benchmarks.

evaluation of the results of optimization achieved by the PARALLator

The evaluation was performed on the 64-bit Linux operating systems running on different x86 CPU's. The gcc version 8.x compiler, used to process the benchmark, was invoked ( in most cases, see this for an exception and the explanation why was it done differently ) with the following command line options:

-S -O2 -ffast-math -mfpmath=sse -msse4.1 -fomit-frame-pointer -fno-dwarf2-cfi-asm -fno-asynchronous-unwind-tables -fno-reorder-blocks

The compiler generated assembly file was optimized by dco using -parallel command line option and the optimized assembly code was processed ( assembled/linked ) by gcc using -fopenmp command line option.

The generated executable was executed on the systems with the minimal possible load. The execution times reported are the elapsed real time between invocation and termination of the program.


results of optimizations achieved by the PARALLator

Here we present the performance results for the set of selected benchmarks. We will only present cases that are interesting and/or important and/or difficult and avoid providing endless list of kernels ( dot-product etc. ) that were successfully parallelized by dco. When possible, the sample of a sequential code, that was successfully parallelized, is shown.

Visit this for information about parallelization by dco of stencils, results of execution of parallelized stencil code and study about the accuracy of the results generated by the parallelized stencil code.

Optimization results for the following benchmarks, that were successfully parallelized by dco, are listed bellow:

The following table presents execution data collected so far for the specified selected benchmarks under different hardware.
Each row represents data for a certain x86 CPU. For each benchmark, the data listed consists of two columns:

not parallel on OpenMP(1)
serial version of the NPB's EP code
alvinn
Molecular Dynamics
E8600

@3.33GHz 2 cores
4.96 37% 1.85 27% 0.44 39% 54.44 40%
3.14 1.35 0.27 32.49
i5-4440

@3.10GHz 4 cores
2.13 17% (2) (3) 38.64 63%
1.77 14.39
Intel(R) Xeon(R)
E5-2690 v4 CPU
@2.60GHz 28 cores(4)
1.96 16% 1.42 60% (3) 33.99 63%
1.64 0.57 12.55
Intel(R) Xeon(R) Silver
4110 CPU
@2.10GHz 16 cores(4)
2.15 22% 1.87 76% (3) 50.28 79%
1.68 0.44 10.56
Intel(R) Xeon(R) Gold
6132 CPU
@2.60GHz 28 cores(4)
1.93 38% 1.47 53% (3) 38.84 81%
1.2 0.69 7.22
AMD EPYC
7351 Processor(5)
32 cores(4)
2.04 -29% 2.47 67% (3) 43.81 66%
2.63 0.81 14.86

(1)  see this for more discussion regarding this benchmark
(2)  no FORTRAN shared libraries were installed, execution of this benchmark not possible
(3)  execution time is too small to allow meaningful comparison
(4)  the benchmarks were executed on Microway's compute cluster; we are grateful for the HPC time provided by Microway, Inc.
(5)  the executables were compiled and optimimized for Intel's CPU's

not parallel on OpenMP

There are many reasons that OpenMP fails to parallelize the code. In many such cases dco is capable to parallelize the code. The following are the examples of some code samples that can not be parallelize by OpenMP but successfuly parallelize by dco. See this for more discussion regarding this subject.

not parallel on OpenMP - data recurrence

The following is example of a recurrence and may not be directly parallelized by OpenMP:


It was successfully parallelized by dco achieving expected execution time improvement: the times reported here are for ARRAY_DIM being set to 100000000 ( which made the overhead of the parallel code to be neglectable ), and, considering the cost of execution of the splitted code, are in the range of the improvement derived from the Amdahl's law.
See this for more discussion regarding this benchmark.

not parallel on OpenMP - not a countable loop

Consider the following code for parallelization:


Note that this loop does not confirm to a requirement of OpenMP - loop must be "countable" - and therefore may not be processed by OpenMP; it also has data recurrence that is not handled by OpenMP.
The code was successfully parallelized by dco.
See this for additional information regarding parallelization of this code. See this, section "Utilized technology", for implementation details.

not parallel on OpenMP - not structured block

OpenMP treats only "structured blocks" ( quote from the OpenMP spec: The block of code must be structured in the sense that it must have a single point of entry at the top and a single point of exit at the bottom; i.e., branching in or out of a structured block is not allowed. )
The following is example of a not structured block ( thus not treatable by OpenMP ) that was successfully parallelized by dco:



serial version of the NPB's EP code

EP - "Embarrassingly Parallel" suite from the NAS Parallel Benchmarks ( NPB ). Actually it appears to be one of the toughest cases of NPB and, definitely, there is nothing embarrassing in successfully parallelizing this benchmark.



The code, as it is presented in NPB serial suite, is not parallelizable. The line 10: q(l) = q(l) + 1.d0 introduces data dependency - the way it is solved by NAS in NPB OpenMP suite is by significantly altering the source of the benchmark. dco doesnt require user to change the source code and performs all necessary alterations "automatically". The only change we did apply is to replace common/storage/ x(2*nk), q(0:nq-1) by common/storage/ x(2*nk);common/storage1/ q(0:nq-1) creating different memory segment for q thus establishing disambiquation of this memory segment from other memory segments; however data dependency, outlined above, wasn't eliminated by this alteration.

We run this benchmark for class W - running for other classes produces different timing but the ratio between time of the compiler generated code and time of the dco generated code remains the same.
See this for execution data achieved.

alvinn

alvinn is a CPU intensive single-precision application from the SPEC floating-point benchmark suite. It represents neural network and robotic applications. Most of the loops of this benchmarks are small and executed for 30 iterations only, ( which made the overhead of the parallel code to be significant ). See this for execution data achieved.

All of the important loops of this benchmark were fully parallelized, for example, the following one ( the body of the routine "input_hidden" which accounts for 64% of the execution time of the sequential version of the code ):



Molecular Dynamics

This benchmark is a CPU intensive double-precision application that implements a molecular dynamics simulation using the velocity Verlet time integration scheme. In order to demonstrate the abilities and power of dco, while compiling the code for this benchmark we added -fno-inline-functions compiler options ( although this doesn't seem to be necessary when compiling with -O2 ) - this ensured that parallelization of the function compute will require full data flow analysis thus making parallelization task to be more difficult and challenging.
We run this benchmark with the following parameters:
2 500 500 0.01
See this for execution data achieved.