The following is a product proposal describing a port of our optimizer to a new processor (click here for description of the existing adaptations of our optimization technology). Please contact us if you like to further discuss the possibilities of porting our optimization technology to a processor of your choice.

Table of Contents:

1.         Introduction
1.1.             Introduction
1.2.             General Overview
1.2.1.                  Features and Description
2.        General Information
2.1.             Source of optimization
2.2.              dco Code Optimization
2.3.             Target Code Improvement
3.         Hardware/Software Requirements and Development Plan
3.1.             Hardware/Software Requirements
3.2.             Development Plan
4.        Code samples
4.1.             Dynamic Memory Disambiguation
4.2.             Software Pipelining
4.3.             Space optimization

1 Introduction

1.1 Introduction

The modern DSP’s/CPU’s requires a tight interface between compiler and hardware architecture in order to achieve  high utilization of the available resources. Compilers have not shown the ability to meet the needs of programmers on critical code. As architecture scales up, it is becoming so complex that human programmers can't deal with the scheduling and tracking of so many registers and execution units. The result may be an architecture that can't be programmed to apply most of its resources on real-life algorithms.

The goal of this project is to develop Code Optimizer ( dco ) - an optimizing code system for the Digital Signal Processor(s) or CPU ( referred as target ). This will be a software package specifically designed to optimize the target code by taking full advantage of the options and features provided by the target microprocessor.

Note, that acquiring our technology will save you tremendous effort in developing house-built solution as well as ensure quick availability of the product.

1.2 General Overview

The dco will be used to optimize code generated by a compiler. The programmer uses a compiler (C/C++, FORTRAN etc.) to translate his code into targets assembly code. This code would be used as an input to dco. The output, generated by the dco, will be a highly optimized targets assembly code that is logically identical to the original one; dco will rearrange the existing code, performing multi-issue optimization, loop unrolling and vectorization, reassigning available registers, etc. To create a final object file the generated code should be assembled.

Note that dco will not require preprocessing or any other involvement from the user. It will be fully automated and it would be possible to incorporate dco into makefiles or other product generation tools.

The use of dco will greatly improve the quality of the generated code.  It, therefore, may prove to be a vital contribution to the production of a winning solution for your Digital Signal Processor(s) or CPU.

1.2.1 Features and Description

  •  Takes full advantage of the target
    dco will optimize a target machine code by taking full advantage of the options and features provided by the target microprocessor.
  •  Fully automated
    dco will not require preprocessing or any other user involvement. It will be fully automated and may be incorporated into makefiles or other product generation tools.
  • Flexible
    dco will have many input options allowing the user to choose the right optimization technique to be performed as well as the level of optimization to be applied.

 2 General Information

2.1 Source of optimization

dco will be a software package that optimizes a target code by taking full advantage of the options and features provided by the target microprocessor.  The following is a description of some of the optimization techniques that will be provided by the package.

  • Static optimizationStatic optimization performs dead-code elimination, copy propagation, code compression and simplification, common subexpression elimination, use of algebraic identities, etc.
  • Memory optimizationMemory optimization performs static and dynamic memory aliasing, attempting to resolve dependencies caused by the memory accessing instructions.
  • Auto parallelization: dco utilizes multiple execution units ( "cores" ) found on many modern processors and offers powerful auto-parallelization that capable to identify code-patterns that are suitable for parallelization, and create the optimized code that will be executed by all the cores available
  • Loop unrollingLoop unrolling duplicates the body of a loop specified number of times creating piece of code with more opportunities for optimization.
  • Software pipeliningSoftware pipelining overlaps epilog code execution for current loop iteration with prolog code execution for the next loop iteration.
  • Multi-issuingMulti-issuing attempts to utilize Instruction Level Parallelism ( ILP ) of the target processor leading to more than one instruction being executed at instruction cycle.
  • Art substitutionArt substitution replaces parts of the code by the logically equivalent code sequences. The generated code is then evaluated in order to find the optimal one.

By default dco will perform most of the optimizations that are available. It will be possible to enable or disable any number of the optimization techniques.

2.2 dco Code Optimization

This paragraph describes the way dco will treat the input code.

dco will operate on basic block, although most of optimizations and scheduling will be done across the blocks. A basic block is a sequence of a targets non-branching instructions in which flow of control enters at the top. dco will assumes as a basic block any sequence of instructions preceded and/or followed by a label or a branch instruction.

Optimization of the code will done as follows:

perform data flow analysis

   extract basic block

          perform static optimization

   build data flow graph

          art substitution
       perform resource reduction
       extract patterns for scheduler
       perform instructions multi-issuing

  

After scanning the input targets assembly source, dco will performs comprehensive data flow analysis calculating resources needed and available for execution of the instructions of the code. It then will extract a basic block(s) to be optimized.

When the basic block is selected, the static optimization will be performed and dynamic memory map will be built. The code generated by the static optimization will then be used to build the data flow graph.

From the data flow graph dco will perform art substitutions and resource dependency reduction and generate patterns of code that are organized into multi-instruction units. dco will attempt to generate the fastest assembly code that is logically identical to the input code.

2.3 Target Code Improvement

At this stage it is difficult to come up with the reasonable estimates of the dco performances. However, this package would be based on the technology developed and implemented for the Intel’s i860 RISC processor ( dco860 ), DEC’s Alpha Axp RISC processor ( ago ) , Analog Devices ADSP-2106x ( SHARC ) family of DSP’s ( compactor ), Analog Devices TS00x ( tigerSHARC ) DSP ( dco ), Freescale’s StarCore DSP ( sco ) and recently for the x86 family of processors ( dco ). The modern DSP’s/CPU’s ( as powerful as they are ) contain little of what haven’t been successfully implemented in the series of the optimizers already implemented ( i860, Alpha and SHARC ). All of the supported CPUs provide multifunctional units that allow to execute more than one instruction per cycle ( i860 - 2, Alpha and SHARC - 4 ), Alpha, SHARC and x86 support conditional instruction execution, i860 supports data types contained across two registers, tigerSHARC and x86 support SIMD etc.

All of the implemented code optimizers achieve significant code improvements on variety of applications under different optimizing compilers, as summarized in the following table.
 

dco860 ( i860 )

30 to 40% on Linpack, Fourier Transform, Convolution

ago ( Alpha )

20 to 70% on Linpack, Fourier Transform, Convolution, Alvinn

compactor ( SHARC )

20 to 60% on Fourier Transform, Convolution, Matrix Multiply, Filters

dco ( tigerSHARC )

30 to 80% on Convolution, Matrix Multiply, Filters

sco ( StarCore )

30 to 50% on Convolution, Matrix Multiply, Filters

dco ( x86 )

up to 80% on numeric double precision applications

 
Note, that all code improvements are over performance of the code that is maximally optimized by the corresponding compiler. More data is available here or upon request.

Based on this, there is little doubt in the potential of this technology to significantly improve code for your DSP’s/CPU’s.

3 Hardware/Software Requirements and Development Plan

3.1 Hardware/Software Requirements

The development of the product will be done using our resources. You will provide target development package ( compiler, assembler, linker, documentation etc. ) and target platform and/or simulator to execute code.

3.2 Development Plan

The product usually is delivered in 6 months from the signing of the agreement.

Please contact us for more information.

4 Code samples

This section contains some code samples that illustrate various optimization techniques supported by dco and shows code improvements achieved on different code patterns using various compilers for distinct processors.

See  this and  this  for the samples of code  and discussion regarding the new optimization techniques, utilized in our x86 optimizer.

See  this and  this  for the samples of code  and discussion regarding the optimization techniques, utilized in our StarCore optimizer.

4.1 Dynamic Memory Disambiguation

Dynamic memory disambiguation is one of the most powerful optimization of the inner loop body supported by dco. This technique allows, at the time of program execution ( dynamically ), resolve memory conflicts of the code. To achieve that, dco generates two versions of the code: one assuming that memory conflicts are not resolved and the second assuming that memory conflicts are resolved ( which is usually much more efficient ). At the run time, depending on the actual setting of the memory pointers, the appropriate code is executed.

As an example, consider the kernel of the linpack benchmark suite ( called daxpy ):

for ( i = 0; i < n; i++ )
 {
  dy[i] = dy[i] + da*dx[i];
 }

When compiled by the Alpha Axp compiler, the following code is generated ( fully optimized ):

$43:
      ldt $f1,0($18)
      mult $f17,$f1,$f1
      ldt $f10,0($20)
      addt $f10,$f1,$f10
      addl $2,1,$2
      addq $18,8,$18
      cmplt $2,$16,$1
      stt $f10,0($20)
      addq $20,8,$20
      bne $1,$43
$41:

Unrolling this code by 2 and performing dynamic memory disambiguation dco produces the following result:

$43:
      lda      $1, 1
      addl      $2, $1, $1
      subl      $1, $16, $1
      blt      $1, .lualpha_41
 .align 4
.lualpha_42:
      ldt      $f1, 0($18)
      mult      $f17, $f1, $f1
      ldt      $f10, 0($20)
      addt      $f10, $f1, $f10
      addl      $2, 1, $2
      addq      $18, 8, $18
      cmplt      $2, $16, $1
      stt      $f10, 0($20)
      addq      $20, 8, $20
      bne      $1, .lualpha_42
      br      $31, .lualpha_45
.lualpha_41:
      lda      $1, 15
      addq      $18, $1, $1
      subq      $1, $20, $1
      bge      $1, .mcalpha_48
 .align 4
.wlalpha_49:
      ldt      $f12, 0($18)
      ldt      $f1, 8($18)
      mult      $f17, $f12, $f12
      mult      $f17, $f1, $f1
      ldt      $f11, 0($20)
      ldt      $f10, 8($20)
      addt      $f11, $f12, $f11
      addt      $f10, $f1, $f10
      addl      $2, 2, $2
      addl      $2, 1, $1
      addq      $18, 16, $18
      subl      $1, $16, $1
      stt      $f11, 0($20)
      stt      $f10, 8($20)
      addq      $20, 16, $20
      blt      $1, .wlalpha_49
      Br      $31, .mlalpha_54
.mcalpha_48:
      lda      $1, 1
      addq      $18, $1, $1
      subq      $1, $20, $1
      ble      $1, .mcalpha_55
 .align 4
.wlalpha_56:
      ldt      $f12, 0($18)
      ldt      $f1, 8($18)
      mult      $f17, $f12, $f12
      mult      $f17, $f1, $f1
      ldt      $f11, 0($20)
      ldt      $f10, 8($20)
      addt      $f11, $f12, $f11
      addt      $f10, $f1, $f10
      addl      $2, 2, $2
      addl      $2, 1, $1
      addq      $18, 16, $18
      subl      $1, $16, $1
      stt      $f11, 0($20)
      stt      $f10, 8($20)
      addq      $20, 16, $20
      blt      $1, .wlalpha_56
      Br      $31, .mlalpha_54
.mcalpha_55:
 .align 4
.wlalpha_61:
      ldt      $f12, 0($18)
      mult      $f17, $f12, $f12
      ldt      $f11, 0($20)
      addt      $f11, $f12, $f11
      stt      $f11, 0($20)
      ldt      $f1, 8($18)
      mult      $f17, $f1, $f1
      ldt      $f10, 8($20)
      addt      $f10, $f1, $f10
      addl      $2, 2, $2
      addl      $2, 1, $1
      addq      $18, 16, $18
      subl      $1, $16, $1
      stt      $f10, 8($20)
      addq      $20, 16, $20
      blt      $1, .wlalpha_61
      br      $31, .mlalpha_54
.mlalpha_54:
      addl      $2, $31, $1
      subl      $1, $16, $1
      blt      $1, .lualpha_42
.lualpha_45:
$41: 

The unrolled loop:

.wlalpha_61:
      ldt      $f12, 0($18)
      mult      $f17, $f12, $f12
      ldt      $f11, 0($20)
      addt      $f11, $f12, $f11
      stt      $f11, 0($20)
      ldt      $f1, 8($18)
      mult      $f17, $f1, $f1
      ldt      $f10, 8($20)
      addt      $f10, $f1, $f10
      addl      $2, 2, $2
      addl      $2, 1, $1
      addq      $18, 16, $18
      subl      $1, $16, $1
      stt      $f10, 8($20)
      addq      $20, 16, $20
      blt      $1, .wlalpha_61

has memory conflict:

    stt      $f11, 0($20)
    ldt      $f1, 8($18)

The generated code solves this conflict by producing version of the loop with the assumption that memory conflict doesn't exist ( loop labeled .wlalpha_49 - note that all the memory reads ( ldt ) are performed before memory writes ( stt ) ) and version of the loop without such an assumption ( loop labeled .wlalpha_61 - note that that order of the instruction of the memory conflict is preserved ( ldt $f1,8($18) is following stt $f11,0($20) )

In most instances of the linpack benchmark execution, code labeled .wlalpha_49: will be executed thus bringing performance improvement over the original code to 40%.

4.2 Software Pipelining

Software Pipelining is another powerful optimization of the inner loop body. It overlaps epilog code execution for current loop iteration with prolog code execution for the next loop iteration. Essentially, the two consecutive loop iterations are fitted in the block of the size of the loop iteration ( of size n ). m instruction are chosen from the bottom of the first iteration and combined with n - m instructions from the top of the second iteration. The generated m + ( n - m ) = n instructions are optimized. This is done for all m from 1 to n-1 and the best resulting code is chosen. Of course, all necessary checkups to preserve logic of the code are performed by dco and resulting code.

As an example, consider the kernel of the dot product

for ( i = 0; i < n; i++ )
  {
    s += x[i]*y[i];
  }

as generated by the SHARC compiler ( fully optimized ):

_L$5:
        lcntr = r1,  do _L$3-1 until lce;
        f4=dm(i4,m6),f2=pm(i12,m14);
        f8=f2*f4;
        f12=f12+f8;
_L$6:
_L$3:

Execution of this loop takes 3 clocks/point.

Running vectorization on this code, dco produces the following result:

        r0=r1-1;
        if gt jump (pc,__vr$dco_29);
        f12=f8+f12,  f4=dm(i4,m6),  f2=PM(i12,m14);
        f8=f2*f4;
        f12=f8+f12;
        jump (PC,__vr$dco_30);
__vr$dco_29:
        f4=dm(i4,m6);
        f2=PM(i12,m14);
        f12=f8+f12;
        f8=f2*f4;
__vr$dco_31:
        lcntr=r0, do __vr$dco_32-1 until lce ;
        f12=f8+f12,  f4=dm(i4,m6),  f2=PM(i12,m14);
        f8=f2*f4,  f12=f8+f12;
__vr$dco_32:
        f12=f8+f12;
__vr$dco_30:
__dole$dco_0:
_L$3:

Optimized loop:

__vr$dco_31:
        lcntr=r0, do __vr$dco_32-1 until lce ;
        f12=f8+f12,  f4=dm(i4,m6),  f2=PM(i12,m14);
        f8=f2*f4,  f12=f8+f12;
__vr$dco_32:

execution takes 2 clocks/point - 33% improvement over the compiler-generated code. Better results may be achieved by combining vectorization with loop unrolling ( also supported by dco ).

4.3 Space optimization

Often reducing the size of the code is as important as improving its execution time. dco provides support for code-size reduction.

The following code is generated by the tigerSHARC compiler ( fully optimized ) and represents a kernel of a DSP benchmark ( code size –39 instructions ):

_L$750011:
        XR15 = [J26 + 60];;
        XR0 = [J26 + 72];;
        l[J26 + 58] = XR5:4;;
        XR21 = [J26 + 74];;
        XR12 = [J26 + 58];;
        XR1 = PASS R21; XR21 = ASHIFT R21 BY -16; l[J26 + 60] = XR7:6;;
        XR22 = [J26 + 61]; XR17 = PASS R12; XR12 = ASHIFT R12 BY -16;;
        XR13 = [J26 + 60]; XR1 = LSHIFT R1 BY 16;;
        XR2 = XR22; XR17 = LSHIFT R17 BY 16;;
        XR18 = PASS R13; XR22 = ASHIFT R22 BY -16; l[J26 + 72] = XR5:4;;
        XR13 = ASHIFT R13 BY -16; l[J26 + 74] = XR7:6;;
        XR23 = R21 * R22 (I); XR2 = LSHIFT R2 BY 16;;
        XR14 = R12 * R13 (I); XR18 = LSHIFT R18 BY 16;;
        XR0 += FDEP R23 BY R16;;
        XR15 += FDEP R14 BY R16;;
        XR1 = ASHIFT R1 BY -16; [J26 + 63] = XR0;;
        XR2 = ASHIFT R2 BY -16; [J26 + 62] = XR15;;
        XR17 = ASHIFT R17 BY -16;;
        XR18 = ASHIFT R18 BY -16; XR3 = R1 * R2 (I);;
        XR19 = R17 * R18 (I);;
        XR0 += FDEP R3 BY R20;;
        XR15 += FDEP R19 BY R20; [J1 += 1] = XR0;;
        [J1 += 1] = XR15;;
        if nlc0e, jump _L$750011 ;;

When instructed to reduce code size, dco generates the following code ( time - 25 clocks, size 25 instructions - 36% code/size reduction ): 

        xR8=pass R20; yR16=xR16;;
        yR8=xR20;;
_L$750011:
        yR15=[J26 + 72];;
        xR15=[J26 + 60];;
        L[J26 + 58]=xR5:4;;
        yR12=[J26 + 74];;
        xR12=[J26 + 58];;
        L[J26 + 60]=xR7:6;;
        yR13=[J26 + 61]; xyR10= ashift R12 by -16 ;;
        xR13=[J26 + 60]; xyR17= lshift R12 by 16 ;;
        xyR0= ashift R13 by -16 ; L[J26 + 72]=xR5:4;;
        xyR18= lshift R13 by 16 ; L[J26 + 74]=xR7:6;;
        xyR14=  R10 * R0 (I); xyR2= ashift R17 by -16 ;;
        xyR0= ashift R18 by -16 ;;
        xyR15+= fdep R14 by R16 ;;
        xyR19=  R2 * R0 (I); [J26 + 63]=yR15;;
        [J26 + 62]=xR15;;
        xyR15+= fdep R19 by R8 ;;
        [J1 += 1]=yR15;;
        if nLC0E, jump _L$750011; [J1 += 1]=xR15;;

When instructed to produce the fastest code, dco generates the following code ( time 23 clocks, size 33 instructions ):

        yR7=xR16;;
        yR6=xR20;;
_L$750011:
        yR4=[J26 + 72];;
        xR15=[J26 + 60];;
        L[J26 + 58]=xR5:4;;
        yR12=[J26 + 74];;
        xR12=[J26 + 58];;
        L[J26 + 60]=xR7:6;;
        yR3=[J26 + 61]; xyR17= lshift R12 by 16 ;;
        xR13=[J26 + 60]; xyR2= ashift R12 by -16 ;;
        yR1= ashift R3 by -16 ; xR17= ashift R17 by -16 ; L[J26 + 72]=xR5:4;;
        yR5= lshift R3 by 16 ; xR8= ashift R13 by -16 ; L[J26 + 74]=xR7:6;;
        xR18= lshift R13 by 16 ; yR3=  R2 * R1 (I); yR1= ashift R17 by -16 ;;
        yR2= ashift R5 by -16 ; xR14=  R2 * R8 (I);;
        xR18= ashift R18 by -16 ; yR4+= fdep R3 by R7 ;;
        yR0=  R1 * R2 (I); xR15+= fdep R14 by R16 ; [J26 + 63]=yR4;;
        xR19=  R17 * R18 (I); [J26 + 62]=xR15;;
        yR4+= fdep R0 by R6 ;;
        xR15+= fdep R19 by R20 ; [J1 += 1]=yR4;;
        if nLC0E, jump _L$750011; [J1 += 1]=xR15;;