following is a product proposal describing a port of our optimizer to a
new processor (click here
description of the existing adaptations of our optimization
technology). Please contact
if you like to further discuss the
porting our optimization technology to a processor of your choice.
The modern DSP’s/CPU’s requires a tight interface between compiler and hardware architecture in order to achieve high utilization of the available resources. Compilers have not shown the ability to meet the needs of programmers on critical code. As architecture scales up, it is becoming so complex that human programmers can't deal with the scheduling and tracking of so many registers and execution units. The result may be an architecture that can't be programmed to apply most of its resources on real-life algorithms.
goal of this
project is to develop Code Optimizer ( dco
) - an optimizing code system for the Digital Signal Processor(s)
or CPU ( referred as target ). This will be a software package
specifically designed to optimize the target code by taking full
advantage of the options and features provided by the target
Note, that acquiring our
technology will save you tremendous effort in developing house-built
solution as well as ensure quick availability of the product.
The dco will be used to optimize code generated by a compiler. The programmer uses a compiler (C/C++, FORTRAN etc.) to translate his code into targets assembly code. This code would be used as an input to dco. The output, generated by the dco, will be a highly optimized targets assembly code that is logically identical to the original one; dco will rearrange the existing code, performing multi-issue optimization, loop unrolling and vectorization, reassigning available registers, etc. To create a final object file the generated code should be assembled.
Note that dco will not require preprocessing or any other involvement from the user. It will be fully automated and it would be possible to incorporate dco into makefiles or other product generation tools.
The use of dco will greatly improve the quality of the generated code. It, therefore, may prove to be a vital contribution to the production of a winning solution for your Digital Signal Processor(s) or CPU.
dco will be a software package that optimizes a target code by taking full advantage of the options and features provided by the target microprocessor. The following is a description of some of the optimization techniques that will be provided by the package.
By default dco will perform most of the optimizations that are available. It will be possible to enable or disable any number of the optimization techniques.
This paragraph describes the way dco will treat the input code.
dco will operate on basic block, although most of optimizations and scheduling will be done across the blocks. A basic block is a sequence of a targets non-branching instructions in which flow of control enters at the top. dco will assumes as a basic block any sequence of instructions preceded and/or followed by a label or a branch instruction.
Optimization of the code will done as follows:
perform data flow analysis
extract basic block
perform static optimization
build data flow graph
After scanning the input targets assembly source, dco will performs comprehensive data flow analysis calculating resources needed and available for execution of the instructions of the code. It then will extract a basic block(s) to be optimized.
When the basic block is selected, the static optimization will be performed and dynamic memory map will be built. The code generated by the static optimization will then be used to build the data flow graph.
From the data flow graph dco will perform art substitutions and resource dependency reduction and generate patterns of code that are organized into multi-instruction units. dco will attempt to generate the fastest assembly code that is logically identical to the input code.
At this stage it is difficult to come up with the reasonable estimates of the dco performances. However, this package would be based on the technology developed and implemented for the Intel’s i860 RISC processor ( dco860 ), DEC’s Alpha Axp RISC processor ( ago ) , Analog Devices ADSP-2106x ( SHARC ) family of DSP’s ( compactor ), Analog Devices TS00x ( tigerSHARC ) DSP ( dco ), Freescale’s StarCore DSP ( sco ) and recently for the x86 family of processors ( dco ). The modern DSP’s/CPU’s ( as powerful as they are ) contain little of what haven’t been successfully implemented in the series of the optimizers already implemented ( i860, Alpha and SHARC ). All of the supported CPUs provide multifunctional units that allow to execute more than one instruction per cycle ( i860 - 2, Alpha and SHARC - 4 ), Alpha, SHARC and x86 support conditional instruction execution, i860 supports data types contained across two registers, tigerSHARC and x86 support SIMD etc.
the implemented code
optimizers achieve significant code improvements on variety of
applications under different optimizing compilers, as summarized in the
Based on this, there is little doubt in the potential of this technology to significantly improve code for your DSP’s/CPU’s.
The development of the product will be done using our resources. You will provide target development package ( compiler, assembler, linker, documentation etc. ) and target platform and/or simulator to execute code.
The product usually is delivered in 6 months from the signing of the agreement.
Please contact us for more information.
This section contains some code samples that illustrate various optimization techniques supported by dco and shows code improvements achieved on different code patterns using various compilers for distinct processors.
Dynamic memory disambiguation is one of the most powerful optimization of the inner loop body supported by dco. This technique allows, at the time of program execution ( dynamically ), resolve memory conflicts of the code. To achieve that, dco generates two versions of the code: one assuming that memory conflicts are not resolved and the second assuming that memory conflicts are resolved ( which is usually much more efficient ). At the run time, depending on the actual setting of the memory pointers, the appropriate code is executed.
As an example, consider the kernel of the linpack benchmark suite ( called daxpy ):
for ( i =
0; i < n; i++ )
When compiled by the Alpha Axp compiler, the following code is generated ( fully optimized ):
Unrolling this code by 2 and performing dynamic memory disambiguation dco produces the following result:
The unrolled loop:
has memory conflict:
The generated code solves this conflict by producing version of the loop with the assumption that memory conflict doesn't exist ( loop labeled .wlalpha_49 - note that all the memory reads ( ldt ) are performed before memory writes ( stt ) ) and version of the loop without such an assumption ( loop labeled .wlalpha_61 - note that that order of the instruction of the memory conflict is preserved ( ldt $f1,8($18) is following stt $f11,0($20) )
In most instances of the linpack benchmark execution, code labeled .wlalpha_49: will be executed thus bringing performance improvement over the original code to 40%.
Software Pipelining is another powerful optimization of the inner loop body. It overlaps epilog code execution for current loop iteration with prolog code execution for the next loop iteration. Essentially, the two consecutive loop iterations are fitted in the block of the size of the loop iteration ( of size n ). m instruction are chosen from the bottom of the first iteration and combined with n - m instructions from the top of the second iteration. The generated m + ( n - m ) = n instructions are optimized. This is done for all m from 1 to n-1 and the best resulting code is chosen. Of course, all necessary checkups to preserve logic of the code are performed by dco and resulting code.
As an example, consider the kernel of the dot product
( i = 0; i < n; i++ )
as generated by the SHARC compiler ( fully optimized ):
Execution of this loop takes 3 clocks/point.
Running vectorization on this code, dco produces the following result:
execution takes 2 clocks/point - 33% improvement over the compiler-generated code. Better results may be achieved by combining vectorization with loop unrolling ( also supported by dco ).
Often reducing the size of the code is as important as improving its execution time. dco provides support for code-size reduction.
The following code is generated by the tigerSHARC compiler ( fully optimized ) and represents a kernel of a DSP benchmark ( code size –39 instructions ):
When instructed to reduce code size, dco generates the following code ( time - 25 clocks, size 25 instructions - 36% code/size reduction ):
When instructed to produce the fastest code, dco generates the following code ( time 23 clocks, size 33 instructions ):