Table of Contents
The generation of optimal code for the x86 architecture requires technical methods that were not utilized
for the previous hardware architecture's and therefore not known to the most programmers and not incorporated
into the x86 tools currently available on the market. This all leads to inefficient and time consuming coding
for the x86, and generated code, in most cases, is far from being optimal.
We can help you to solve this problem. We developed an optimizing/parallelizing code system for the x86 family of processors.
dco is a software package specifically designed to optimize x86 assembly code by taking full advantage of the options and features
provided by the x86 processor.
dco shall be used to optimize compiler-generated code. The programmer uses a compiler (C, Fortran etc.) to translate his code into
x86 assembly code. This code would be used as an input to dco. The output, generated by the dco, will be a highly optimized
x86 assembly code that is logically identical to the original one; dco will rearrange the existing code, performing optimizations
that take full advantage of the functionality offered by the x86 processor. To create a final object file the generated code should be assembled.
Note that dco does not require preprocessing or any other involvement from the user. It is fully automated and may be incorporated into makefiles
or other product generation tools.
Use of dco will greatly improve the quality of the generated code. It, therefore, may prove to be a vital contribution to the production
of a winning x86 solution.
- Takes
full advantage of the x86 functionality
dco
optimizes a x86 code by taking full advantage of the options and
features provided by the x86 processor.
- Fully
automated
dco does
not require preprocessing or any other user involvement. It is fully automated and may be incorporated into makefiles or other
product generation tools.
- Flexible
dco has many input options allowing the user to choose the right
optimization technique to be performed as well as the level of optimization to be applied.
The currently available implementation of the
optimizer shall be used to optimize code generated only by a gcc
compiler. It accepts x86 assembly code in the so-called AT&T assembler
syntax ( see here
for some explanation ); so
addsd %xmm1,%xmm2
is interpreted as
%xmm2 = %xmm2 + %xmm1
dco
currently supports IA-32
and x86-64 architectures featuring SSE, SSE2, SSE3 and SSE4 extensions.
dco is currently available for Linux OS. The experimental version of the product for 64 bit Windows OS is also provided.
See this for more details.
dco is a software package that
optimizes a x86 code by taking full advantage of the options and features provided by the processor.
It implements great number of optimization techniques among which the
following seems to be of a particular importance:
- Auto-Parallelization:
dco
utilizes multiple execution units ("cores" ) found on the modern x86
processors and offers powerful auto-parallelization
that capable to identify code-patterns that are suitable for
parallelization, and to create the optimized code that will be executed by
all the cores available.
- SSE
utilization: dco takes full advantage of the SSE/SSE2/SSE3/SSE4
functionality being capable, among many things, to utilize SIMD
instructions.
- Memory
optimization: Memory
optimization performs static and dynamic memory aliasing, attempting to
resolve dependencies caused by the memory accessing instructions. dco
is capable to maintain accurate address representation which allows to
compare different addresses as well as to determine address alignment.
- Instruction-Level
Parallelization:
Instruction-level parallelization attempts to rearrange instructions to
take advantage of the numerous functional units available on a
x86 microprocessor, so that more than one instruction can be executed in each cycle.
By default dco
will perform most of the optimizations that are available. It is
possible to enable or disable any number of the optimization techniques.
A basic block is a sequence of x86 non-branching instructions in
which flow of control enters at the top. dco
assumes as a basic block any sequence of instructions preceded and/or
followed by a label or a branch instruction.
dco
determines resource usage of subroutines defined in the code to be
optimized and assumes the following x86 programming calling conventions
for subroutine calls:
for
the IA-32 ( 32-bit mode ) Linux code:
- saved by a call
EBP, EBX, ESI, EDI
- used by a call ( parameters etc. )
ESP, EDX, ECX, EAX, EBX, EBP, ESI, EDI
- updated by a call
EAX, EDX, ECX, MM0-MM7, XMM0-XMM7, all flags
for
the x86-64 ( 64-bit mode ) Linux code:
- saved by a call
RBP, RBX, R12-R15
- used by a call ( parameters etc. )
RSP, RDI, RSI, RDX, RCX, R8, R9, XMM0-XMM7
- updated by a call
RAX, RCX, RDX, RSI, RDI, R8-R11, MM0-MM7, XMM0-XMM15, all flags
for
the x86-64 ( 64-bit mode ) Windows code:
- saved by a call
RBX, RSI, RDI, RBP, R12-R15, XMM6-XMM15
- used by a call ( parameters etc. )
RSP, RCX, RDX, RR8, R9, XMM0-XMM3
- updated by a call
RAX, RCX, RDX, R8-R11, MM0-MM7, XMM0-XMM5, all flags
Although the stack word size is 8 bytes, it is assumed that the stack pointer is aligned by 16 at ( before ) any call instruction.
Consequently we rely on
RSP = 8 modulo 16
at every function entry. It was observed that when compiling with option -O0 ( and,
likely, when using -Os or -mpreferred-stack-boundary=2 )
this assumption doesn't hold and therefore such a code may not be optimized by dco.
It is assumed that the direction flag df is cleared by default.
If the direction flag is set, then it assumed that it will be cleared again before any call or return.
If the directional flag is set by dco, it will be cleared right after it use.
The invocation of dco
has the form:
dco <parameter list>
The following is a list of the parameters and descriptions of their functionality:
[-i] file_name
- specifies an input source file which contains a x86 assembly code to be optimized. It is the
only parameter that must be specified.
-o file_name
- specifies an output file which contains a x86 assembly code generated by the dco. stdout
is used if this parameter is not specified.
-32
- processes code for 32-bit environment ( IA-32 ); this option is available only on Linux.
-64
- processes code for 64-bit environment ( x86-64 ); this option is available only on Linux.
-er
- exact results - ensures that optimized code will generate results
identical to the original one. See this
for more explanation.
-noer
- allows optimizations that may generate results not identical to the original one. See this
for more explanation.
-packing
- enables the SIMDinator
that attempts to pack SIMD instructions.
-nopacking
- disables the SIMDinator
that attempts to pack SIMD instructions; note that SIMD instructions may still be used.
-parallel
- enables auto-parallelization. See this
for more explanation.
-noparallel
- disables auto-parallelization.
-lu #
- specifies the loop unrolling parameter ( number of times the loop will be unrolled ).
The default is 1 ( no loop unrolling ). See this
for more explanation.
-bbs #
- basic block size - specified the maximum number of instructions in a basic block to be processed. The default is 100.
-space
- space optimizations, improves code size rather that execution time
of the code ( that is a default ).
-nospace
- speed optimizations, improves the execution time of the code possibly making it large.
-quick
- quick optimizations - performs optimizations only for basic blocks that are
bodies of loops.
-noquick
- performs optimizations for all basic blocks of the code.
-slct
- causes only selected areas of code to be optimized. See this
for more explanation.
The default invocation parameters are:
-64
-noer
-lu 1
-bbs 100
-noparallel
-noquick
-packing
To obtain help information from dco,
invoke it without any parameters.
To obtain brief description of the
input options, invoke: dco -h.
To obtain brief description of the product, invoke: dco -about.
dco
excepts as an input x86 assembly files in the so-called AT&T assembler
syntax ( see here
for details ). All comment lines are ignored except for the exceptions described in the following two sections.
dco allows you to optimize only selected portions of your code. To do
this you must specify the -slct option during command invocation and select the portions of the code
to be optimized. Code selection is done by enclosed the desire
portion of the code to be optimized.
Code selection may be done in the following ways.
Code selection can be specified by enclosing the desirable portion of the code between the calls to the following functions:
[text]dco_start[text]()
and
[text]dco_end[text]()
[text] preceding preceding/following dco_start,
dco_end may used to comment about portion of the code being selectively optimized and
doesn't have to be the same for start/end function name.
Note that [text]dco_start[text] and [text]dco_end[text]
shall be valid function names.
See this for example of how to use this specification.
Code selection can be specified by enclosing the desirable portion of the code between the comment lines:
#.dco_start <option list>
and
#.dco_end
#.dco_start <option list> indicates the beginning of the portion of the code that should be selectively optimized. The
selectively optimized code extends till the end of the input file or
till the comment line #.dco_end.
<option_list>
is list of options that will be in effect while optimizing the
selected portion of the code; this options, if specified, alters
default parameters or parameters specified on the invocation line.
The options that may be specified include all the options listed here except
-slct, -i and
-o
.
Note that selection of the code portion shall be done exactly as
written above. For example,
# .dco_start
will be considered as just a comment line ( #
followed by a space ).
See this for more on selected code optimization.
dcoallows you to change the options that where in affect during
invocation. This is done by specifying comment line:
#.dco_options <option list>
<option_list> is list of
options that will be in effect while optimizing the following portion of the code; this options, if
specified, alters default parameters or parameters specified on the
invocation line. Specifying #.dco_options
without <option_list>
will cause the options to be restored to those of the invocation of dco.
The options that may be specified include all the options listed here
except -slct, -i and -o.
Note that the comment for option specification should look exactly as
written above. For example,
#
.dco_options
will be considered as just a comment line.
dco offers powerful auto-parallelization
that capable to identify code-patterns that are suitable for parallelization, and create the optimized code that will be executed by
all the cores available. See this for additional
information about auto parallelization provided by dco.
To enable auto parallelization you must specify the -parallel
option and request the use of the OpenMP library during linkage ( one way to achieve that is to specify -fopenmp
option while linking ). For example, to parallelized 'test.c'
creating the executable 'test' do the following:
gcc -S test.c
dco -i test.s -o otest.s -parallel
gcc -o test -fopenmp otest.s
rm otest.s test.s
dco auto-parallelizes
code sequences spanning numerous basic blocks that may include function calls. It is assumed that functions from the standard libraries are ISO
C and POSIX compliant satisfying requirement specified here.
Do not use auto parallelizer if that is not the case on your development system.
Auto-parallelizer shall not be attempted on the code that is already parallelized, e.g. by OpenMP or dco.
This
section contains hints and suggestions on using features provided by dco.
It should not be considered a comprehensive guide to the usage of dco.
As you gain experience using the product, you will develop other
techniques which suit your needs and professional habits.
dco is designed to work
with gcc compiler on Linux or port of the gcc compiler on Windows
( e.g. mingw-w64 )
which generates an assembly output of the compiled code - the way it is achieved is by specifying
-S option during compilation.
Compiler generated code shall confirm to the mode of operation that dco invoked:
For Linux:
if -32 is specified - compiler shall generate 32-bit code
if -64 is specified - compiler shall generate 64-bit code
For Windows:
compiler shall always generate 64-bit code
While in 64-bit mode on Linux, dco was fully verified to work with the clang compiler version 9.0.1 - in the current version
of the dco we provide support for the clang as "experimental".
Assume that the compiler driver gcc
is available on your system. To optimize the file 'test.c' do the following:
gcc -S test.c
dco -i test.s -o otest.s
mv otest.s test.s
gcc -c test.s
rm test.s
gcc -S test.c compiles the input file 'test.c'
and generates assembly output file 'test.s'. You may specify other compiler options ( to perform optimization etc.), see
this for information about that.
dco -i test.s -o otest.s optimizes
the input file 'test.s' generating as an output file 'otest.s'.
mv otest.s test.s renames file 'otest.s' to 'test.s'.
gcc -c test.s assembles file
'test.s' producing as an output object file 'test.o'.
rm test.s deletes file 'test.s'.
The described procedure may be easily incorporated into makefiles, batch
files or other product generation tools. For example, makefiles often used to generated object files by specifying rules to translate
C-source into object-file, e.g.
.c.o:
$(CC) $(CFLAGS)
-c $<
In order to incorporate dco
the rule may be rewritten as:
.c.o:
$(CC) $(CFLAGS)
-S $<
dco -i $*.s -o $*.so $(DCO_OPT)
mv $*.so $*.s
$(CC) $(CFLAGS) -c $*.s
rm $*.s
This paragraph explains the usage of the basic command options. Note that,
in most cases, disabling an optimization option will decrease the quality of resulting code.
dco always produces code that is mathematically equivalent to the
original. However, due to the inexact nature of the floating point execution, the results of the optimized code may differ
from that of the original code. For example, the original code:
addsd %xmm2,%xmm6
addsd %xmm3,%xmm6
addsd %xmm4,%xmm6
may be substituted by:
addsd %xmm2,%xmm3
addsd %xmm4,%xmm6
addsd %xmm3,%xmm6
which, although being mathematically equivalent to the original, may produce a
different value in the register xmm6.
Optimizations
that may cause such a behaviour may be
disabled by using the parameter -er.
Choosing
this option ( -quick ) may significantly
reduce the CPU time of the package execution without having great impact on the quality of the produced code.
No special preparations are necessary for
the source code to be optimized by dco.
However it is
strongly
suggested to optimized only portions of the code that program spends most of the time executing; to do that use
selected code optimization.
To use selected code optimization and/or to change optimizers options ( as specified here
) the source of the program shall be altered before the compilation and/or the assembly input to dco
shall be changed before the optimization.
The following shows how to prepare the block of the Fortran code for optimization by dco.
Note that calls to the special functions dco_start and dco_end
are compiled conditionally thus allowing to compile original code without any modification. Should dco be used,
-DDCO shall be specified as compiler option during compilation.
#ifdef DCO
call dco_start
#endif
do 140 i = 1, nk
x1 = 2.d0 * x(2*i-1) - 1.d0
x2 = 2.d0 * x(2*i) - 1.d0<
t1 = x1 ** 2 + x2 ** 2
if (t1 .le. 1.d0) then
t2 = sqrt(-2.d0 * log(t1) / t1)
t3 = (x1 * t2)
t4 = (x2 * t2)
l = max(abs(t3), abs(t4))
q(l) = q(l) + 1.d0
sx = sx + t3
sy = sy + t4
endif
140 continue
#ifdef DCO
call dco_end
#endif
gcc provides asm function
that allows to change the C source before the compilation as following:
.
asm( "#.dco_start" );
Code to be optimized by dco
asm( ".dco_end" );
.
dco is expecting high quality optimized code as it input. Therefore it is necessary to use
appropriate compiler options to generate such a code. In order to further facilitate the optimization, it is recommended also
to include -fomit-frame-pointer and -fno-optimize-sibling-calls options.
The following are compiler options we used to evaluate dco:
-S -O2 -fomit-frame-pointer -fcf-protection=none -ffast-math -march=x86-64 -m64 -mfpmath=sse -msse2 -msse3 -fno-dwarf2-cfi-asm -fno-asynchronous-unwind-tables -fno-optimize-sibling-calls -freorder-blocks-algorithm=simple
Use the following options to disable debugging information from being generated and/or code from being prepared for debugging:
-fno-asynchronous-unwind-tables
-fno-dwarf2-cfi-asm
-fomit-frame-pointer
-fcf-protection=none
It is recommended to use the following option to generate appropriate code:
-ffast-math
-march=x86-64
-m64
-mfpmath=sse
-msse2
-msse3
-fno-optimize-sibling-calls
It was observed gcc optimizations to generate code of a dubious quality with uncertain merits ( see example bellow ).
It is strongly recommended to disable some optimizations, particulary the one that affect data flow - use -freorder-blocks-algorithm=simple
or, better, -fno-reorder-blocks.
However general optimizations shall be utilized but it is strongly recommended to use -O2 compiler option for enabling compiler optimizations.
Avoid using -O3 or higher.
The following example demonstrates the reason for that:
if ( a[i] > 0.01 ) { ret = c[i]; }
|
comisd a(%rdx), %xmm1
jnb .L
movsd c(%rdx), %xmm0
.L:
|
movapd a(%rax), %xmm6
cmpltpd %xmm6, %xmm7
movapd %xmm6, %xmm3
movapd %xmm1, %xmm6
movapd %xmm9, %xmm10
cmplepd %xmm1, %xmm3
cmplepd %xmm1, %xmm10
movapd %xmm13, %xmm14
cmplepd %xmm1, %xmm14
pand %xmm7, %xmm8
pandn %xmm12, %xmm7
movdqa %xmm7, %xmm4
por %xmm8, %xmm4
andpd %xmm3, %xmm15
movdqa %xmm0, %xmm12
andnpd c-80(%rax), %xmm3
orpd %xmm3, %xmm15
andpd %xmm10, %xmm15
|
Although -O3 generated code uses packed data and eliminates conditional jump,
it doesnt appear to be more efficient and is much more difficult to process than -O2 generated code.
dco has certain assumptions for the code it processes ( see Stack Alignment ).
Avoid using compiler options that may cause generated code not to satisfy these assumptions. Generally avoid using esoteric and not well understood
compiler options, such as:
-O0
-Os
-mpreferred-stack-boundary=
Although, if necessary, disable code for unsupported extensions ( e.g. AVX extendtions ) see this for clarifications.
It was observed that in certain cases it is necessary to specify -no-pie option while linking object files
created from the dco optimized code.