Modern x86 processors offer multiple execution units ( "cores" )
which allow simultaneous ( "parallel" ) execution of multiple code
utilizes this functionality and offers powerful automatic parallelization
that capable to identify code-patterns that are suitable for
parallelization, and create the optimized code that will be executed by
all the cores available. The part of dco
that implements auto-parallelizer is called PARALLator ( pronounced paral-a-ter ).
In this article we will:
Please refer to the following article and
accompanying slide show for a brief discussion
about our auto parallelization technology.
implemented automatic parallelization is suitable for shared memory
multi-core systems with uniform memory access ( UMA ), meaning multiple
identical processors ( cores ) on a single chip that have equal access and access times to memory.
dco utilizes threads model
with a single process creating and managing multiple threads that are
executed in parallel by different cores. It is implemented by calling
subroutines from the OpenMP library thus requiring this library to be
linked ( one way to achieve that is to specify
option during linking ).
The PARALLator is enabled by the
command line option of dco. PARALLator identifies code sequences suitable for
parallelization performing, if necessary, the following actions:
While generating parallel code dco performs all necessary resource
saving/restoration, memory redirection, code splitting, code
- code transformation to ensure the
resulting code would be possible to parallelized using threads model
memory disambiguation for memory accesses that require that, generating
parallel and sequential version of the code and choosing the one to
execute at the runtime depending on the topology of the utilized
memory; combined with the following step
- conditional code generation to ensure that execution
of the parallel code will be faster
than execution of the sequential version of the code, generating
parallel and sequential version of the code and choosing the one to
execute at the runtime depending on the cost analysis at the time of
execution; combined with the previous step
OpenMP is an application program
interface (API) that may be used to explicitly direct multi-threaded,
shared memory parallelism. It has become a successful user-model for
developing parallel applications and has recently emerged as the de
facto standard for shared-memory parallel programming. Therefore it is
important to distinguish dco from
OpenMP and to establish advantages of
the use of dco over OpenMP.
dco and OpenMP are used very
differently. OpenMP requires from user to
dco's parallel code generation
is fully automated and does not require any involvement from the user.
- identify code sequences to be parallelized, ensuring no (
resource/memory ) conflicts and verifying that parallel code
execution will be faster
- modify the source code with the appropriate directives for
parallel code generation
Besides the differences in use, dco
is capable to parallelize code that OpenMP is not
capable; see this for example.
dco generated parallel
program for a sequential code consists of two parts:
- master thread
- parallel thread
parallel thread acts as a separate subroutine that:
- performs preprocessing, saving necessary resources,
calculating loop count, establishing data to be used by a parallel
- establishes memory regions that are used for communication
with parallel threads
- creates a team of parallel threads and dispatches them to
different cores for execution
- synchronizes and terminates parallel threads on their completion
- performs postprocessing, combining the results of the
parallel threads execution, restoring necessary resources etc.
Both, master and parallel, threads use subroutines
from the OpenMP library some of which come at a considerable
cost. This overhead may significantly impact the performance of
parallelized code. While analysing the code to be parallelized, dco
considers the overhead of the use of OpenMP subroutines avoiding
parallelization if it determines that the cost of parallel execution exceeds it benefits.
- calculates parallel threads loop count
- calculates resources at the entrance to the portion of the
sequential code executed by a parallel thread ( performing code
splitting, if necessary )
- executes properly modified sequential code
- saves necessary data for the postprocessing by the master
It was estimated that on the Core2 system the overhead of utilizing
OpenMP subroutines is 3500 clocks. It means that, for example, on the
dual core system parallelization may be beneficial only for sequential
loops running for at least 7000 clocks:
if execution time of a sequential
loop is ExecTime
clocks, execution time of the parallel code on the dual core system
will be ( roughly ) ExecTime/2 + Overhead
for parallelization to be beneficial, execution time of the parallel
code should not exceed the execution time of the sequential code:
Overhead + ExecTime/2 < ExecTime
ExecTime > 2*Overhead
Note that cost of the OpenMP library
routines,utilized by dco, is only one ( but
not the only ) factor that affects the performance of the parallel code.
The parallel code often behaves in a manner that is not intuitive or predictable.
The following articles we wrote may help you to master deeper understanding of parallel programming.
Here we explain the evaluation process and present the performance results for the set of selected
The evaluation was performed on the 64-bit Linux operating systems
running on different x86 CPU's. The gcc version 8.x compiler, used to process
the benchmark, was invoked ( in most cases, see this for an exception and the explanation why was it done differently )
with the following command line options:
The compiler generated assembly file was optimized by dco using
command line option and the optimized assembly code was processed ( assembled/linked ) by gcc
using command line option.
The generated executable was executed on the systems with the minimal possible load. The execution times reported are the elapsed real time between
invocation and termination of the program.
Here we present the performance results for the set of selected
benchmarks. We will only present cases that are interesting and/or
important and/or difficult and avoid providing endless list of kernels (
dot-product etc. ) that were successfully parallelized by dco.
When possible, the sample of a sequential code, that was successfully parallelized, is shown.
Optimization results for the following benchmarks, that were successfully parallelized by dco, are listed bellow:
The following table presents execution data collected so far for the specified selected benchmarks under different hardware.
Each row represents data for a certain x86 CPU. For each benchmark, the data listed consists of two columns:
- the first column lists execution times and consists of two rows:
- execution time of gcc generated serial code in seconds.
- execution time of dco parallelized code in seconds.
- the second column shows how much faster is dco generated parallel code than gcc generated serial code.
||not parallel on OpenMP(1)
||serial version of the NPB's EP code
@3.33GHz 2 cores
@3.10GHz 4 cores
E5-2690 v4 CPU
@2.60GHz 28 cores(4)
|Intel(R) Xeon(R) Silver
@2.10GHz 16 cores(4)
|Intel(R) Xeon(R) Gold
@2.60GHz 28 cores(4)
(1) see this for more discussion regarding this benchmark
(2) no FORTRAN shared libraries were installed, execution of this benchmark not possible
(3) execution time is too small to allow meaningful comparison
(4) the benchmarks were executed on Microway's compute cluster;
we are grateful for the HPC time provided by Microway, Inc.
(5) the executables were compiled and optimimized for Intel's CPU's
The following is example of a recurrence and may not be directly parallelized by OpenMP:
It was successfully parallelized by dco
achieving expected execution time improvement:
the times reported here are for ARRAY_DIM being set to 100000000 ( which made the overhead of
the parallel code to be neglectable ), and, considering the cost of execution of
the splitted code, are in the range of the improvement derived from the Amdahl's law.
See this for more discussion regarding this benchmark.
EP - "Embarrassingly Parallel" suite from the NAS Parallel Benchmarks ( NPB ).
Actually it appears to be one of the toughest cases of NPB and, definitely, there is nothing embarrassing
in successfully parallelizing this benchmark.
The code, as it is presented in NPB serial suite, is not parallelizable.
The line 10: q(l) = q(l) + 1.d0 introduces data dependency -
the way it is solved by NAS in NPB OpenMP suite is by significantly altering the source of the benchmark.
dco doesnt require user to change the source code and performs all necessary
alterations "automatically". The only change we did apply is to replace
common/storage/ x(2*nk), q(0:nq-1) by
common/storage/ x(2*nk);common/storage1/ q(0:nq-1) creating different memory segment for q
thus establishing disambiquation of this memory segment from other memory segments; however data dependency, outlined above, wasn't eliminated by this alteration.
We run this benchmark for class W - running for other classes produces different timing but the ratio between time
of the compiler generated code and time of the dco generated code remains the same.
See this for execution data achieved.
alvinn is a CPU intensive single-precision application from the SPEC
floating-point benchmark suite. It represents neural network and
robotic applications. Most of the loops of this benchmarks are small
and executed for 30 iterations only, ( which
made the overhead
of the parallel code to be significant ). See this for execution data achieved.
All of the important loops of this benchmark were fully parallelized, for example, the following
one ( the body of the routine "input_hidden" which accounts for 64% of
the execution time of the sequential version of the code ):
This benchmark is a CPU intensive double-precision application that implements a molecular dynamics simulation using the velocity Verlet time integration scheme.
In order to demonstrate the abilities and power of dco, while compiling the code for this benchmark
we added compiler options
( although this doesn't seem to be necessary when compiling with ) -
this ensured that parallelization of the function compute
will require full data flow analysis thus making parallelization task to be more difficult and challenging.
We run this benchmark with the following parameters:
this for execution data achieved.