Modern x86 processors offer multiple execution units ( "cores"  ) which allow simultaneous ( "parallel" ) execution of multiple code sequences. dco utilizes this functionality and offers powerful automatic parallelization that capable to identify code-patterns that are suitable for parallelization, and create the optimized code that will be executed by all the cores available. The part of dco that implements auto-parallelizer is called PARALLator ( pronounced paral-a-ter ). In this article we will:

how the PARALLator operates

dco implemented automatic parallelization is suitable for shared memory multi-core systems with uniform memory access ( UMA ), meaning multiple identical processors ( cores ) on a single chip that have equal access and access times to memory.

dco utilizes threads model with a single process creating and managing multiple threads that are executed in parallel by different cores. It is implemented by calling subroutines from the OpenMP library thus requiring this library to be linked ( one way to achieve that is to specify
 -fopenmp option during linking ).

The PARALLator is enabled by the -parallel command line option of dco. PARALLator identifies code sequences suitable for parallelization performing, if necessary, the following actions:
While generating parallel code dco performs all necessary resource saving/restoration, memory redirection, code splitting, code modification, thread startup/termination etc.

compare with OpenMP

OpenMP is an application program interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism. It has become a successful user-model for developing parallel applications and has recently emerged as the de facto standard for shared-memory parallel programming. Therefore it is important to distinguish dco from OpenMP and to establish advantages of the use of dco over OpenMP.

dco and OpenMP are used very differently. OpenMP requires from user to
dco's parallel code generation is fully automated and does not require any involvement from the user.

Besides the differences in use, dco is capable to parallelize code that OpenMP is not capable; see this for example.

parallel code explained

dco generated parallel program for a sequential code consists of two parts:
  1. master thread
  2. parallel thread

master thread

parallel thread

parallel thread acts as a separate subroutine that:

overhead of the parallel code

Both, master and parallel, threads use subroutines from the OpenMP library which  comes at considerable cost. This overhead may significantly impact the performance of parallelized code. While analysing the code to be parallelized, dco considers the overhead of the use of OpenMP subroutines avoiding parallelization if it determines that the cost of parallel execution exceeds it benefits.

It was estimated that on the Core2 system the overhead of utilizing OpenMP subroutines is 3500 clocks. It means that, for example, on the dual core system parallelization may be beneficial only for sequential loops running for at least 7000 clocks:

if execution time of a sequential loop is ExecTime clocks, execution time of the parallel code on the dual core system will be ( roughly ) ExecTime/2 + Overhead.
for parallelization to be beneficial, execution time of the parallel code should not exceed the execution time of the sequential code:
Overhead + ExecTime/2 < ExecTime
thus
ExecTime > 2*Overhead

Note that cost of the OpenMP library routines,utilized by dco, is only one ( but not the only ) factor that affects the performance of the parallel code.

optimizations achieved by the PARALLator

Here we present the performance results for the set of selected benchmarks. We will only present cases that are interesting and/or important and/or difficult and avoid listing endless kernels ( dot-product etc. ) that were successfully parallelized by dco. When possible, the sample of a sequential code that was successfully parallelized is shown.

The evaluation was performed on the 64-bit Linux operating system running on the 3.33GHz Core2 dual core computer. The gcc version 4.3.2 compiler, used to process the benchmark, was invoked with the following command line options:

-S -O3 -fomit-frame-pointer -funroll-all-loops -ffast-math -march=core2 -mfpmath=sse -msse4.1 -mmmx

The compiler generated assembly file was optimized by dco using -parallel command line option and the optimized assembly code was processed ( assembled/linked ) by gcc using -fopenmp command line option. The generated executable was executed on the system with the minimal possible load.

The execution times reported is the elapsed real time between invocation and termination of the program.

not parallel on OpenMP

The following is example of a recurrence and may not be directly parallelized by OpenMP:

code to be parallelized ( the inner loop )
double a[ARRAY_DIM], c[ARRAY_DIM];
.
double wrap;
wrap = 0.;
for ( i = 0; i < ARRAY_DIM; i++ ) {
c[i] = exp( wrap );
wrap += a[i];
}

It was successfully parallelized by dco achieving expected execution time improvement:
for
ARRAY_DIM being set to 100000000 ( which made the overhead of the parallel code to be neglectable ), gcc generated serial code ran for 5.327 seconds, dco parallelized code ran for 3.053 seconds which, considering the cost of execution of the splitted code, is in the range of the improvement derived from the Amdahl's law.

tomcatv

tomcatv is a mesh generation program from the SPEC floating-point benchmark suite. It represents computer-aided design (CAD) and similar engineering applications. The dimension of the matrix was set to 127 ( which made the overhead of the parallel code to be notebale ). gcc generated serial code ran for .061 seconds, dcoparallelized code ran for .041 seconds, thus a  33% improvement.

Some of the loops of this benchmark have memory dependencies between different iterations and were not optimized. However most of the loops were fully parallelized ( fully meaning "the outermost and all nested" ), for example, the following one:


fully parallelized nested loops
     DO    310    J = J1P,J2M
JP = J+1
JM = J-1
M = M+1
C
C I-LOOP
C
DO 250 I = I1P,I2M
IP = I+1
IM = I-1
XX = X(IP,J)-X(IM,J)
YX = Y(IP,J)-Y(IM,J)
XY = X(I,JP)-X(I,JM)
YY = Y(I,JP)-Y(I,JM)
A = 0.25*(XY*XY+YY*YY)
B = 0.25*(XX*XX+YX*YX)
C = 0.125 *(XX*XY+YX*YY)
QI = 0.
QJ = 0.
AA(I,M) = -B
DD(I,M) = B+B+A*REL
PXX = X(IP,J)-2.*X(I,J)+X(IM,J)
QXX = Y(IP,J)-2.*Y(I,J)+Y(IM,J)
PYY = X(I,JP)-2.*X(I,J)+X(I,JM)
QYY = Y(I,JP)-2.*Y(I,J)+Y(I,JM)
PXY = X(IP,JP)-X(IP,JM)-X(IM,JP)+X(IM,JM)
QXY = Y(IP,JP)-Y(IP,JM)-Y(IM,JP)+Y(IM,JM)
C
C CALCULATE RESIDUALS ( EQUIVALENT TO RIGHT HAND SIDES OF EQUS.)
C
RX(I,M) = A*PXX+B*PYY-C*PXY+XX*QI+XY*QJ
RY(I,M) = A*QXX+B*QYY-C*QXY+YX*QI+YY*QJ
250 CONTINUE
310 CONTINUE

alvinn

alvinn is a CPU intensive single-precision application from the SPEC floating-point benchmark suite. It represents neural network and robotic applications. Most of the loops of this benchmarks are small and executed for 30 iterations only, ( which made the overhead of the parallel code to be significant ). gcc generated serial code ran for .428 seconds, dco parallelized code ran for .361 second, thus a 16% improvement.

Most of the loops of this benchmark were fully parallelized, for example, the following one ( the body of the routine "input_hidden" which accounts for 65% of the execution time of the sequential version of the code ):


fully parallelized nested loops
  for ( ; receiver <= end_receiver; ) {
*receiver = 0.0;
sender = &input_act[0];
end_sender= &input_act[NIU];
for (; sender <= end_sender; )
*receiver += (*sender++) * (*weight++);
*receiver = (1.0 / (1.0 + exp((double)-(*receiver))));
*receiver++;
}