Optimizing Livermore loops

This page contains dco's optimization results for the Livermore loops benchmark while optimizing code generated by the gcc version 4.2.2 on x86-64 and IA-32 systems. See this for optimization results of the previous version of dco ( version 1.0.1 ) on the same benchmark for the code generated by gcc version 4.1.2.

Preparing the benchmarks

We used the C version of the Livermore loops benchmark. The code was modified to eliminate calibration, thus ensuring that on every run the same number of iterations are executed on the same input data. This makes it possible to compare the execution times of the program ( and not the estimate amount of MFlops as in the original implementation ).

For every kernel of the benchmark, dco was invoked twice: first without any options ( default mode ) and then with the -no-packing option; the best execution time is reported - note that the x86 assembly source that was optimized is one generated when gcc was invoked.

Read this to understand how the benchmarks were executed and code optimization results were calculated.

Timing of the Livermore loops kernels and Results of optimization

The following tables presents the execution data collected while performing benchmarking of the Livermore loops.

The two columns under gcc and gcc+dco headers present execution times ( in seconds ) achieved by the compiler generated code and dco optimized code respectively. The column under the gcc+dco/gcc header lists the improvement achieved by utilizing dco over the compiler generated code. For example, the compiler generated code executed kernel #1 in 3.06 seconds; after optimization by dco the resulting code run for 2.22 seconds which is 27.45% improvement.

results for x86-64 64-bit code

The following are the results of optimizations achieved on 64-bit Linux operating system running on the 2.66GHz Core2 computer.

Thegcc version 4.2.2 compiler, used to process the benchmarks, was invoked with the following command line options:

-S -O3 -fomit-frame-pointer -funroll-all-loops-ffast-math -march=nocona -mfpmath=sse -msse3

The dco version 1.1.0 was used to optimize compiler generated code.

Kernel#	gcc 4.2.2	gcc+dco	gcc+dco/gcc
1	3.06	2.22	27.45%
2	2.21	2.01	9.05%
3	3.82	1.46	61.78%
4	3.15	2.04	35.24%
5	2.52	1.45	42.46%
6	5.78	2.66	53.98%
7	2.6	1.71	34.23%
8	1.72	1.41	18.02%
9	2.21	1.89	14.48%
10	1.8	1.41	21.67%
11	1.89	0.64	66.14%
12	2.99	3	-0.33%
13	1.46	1.47	-0.68%
14	1.53	1.46	4.58%
15	1.98	1.99	-0.51%
16	3.11	2.91	6.43%
17	2.8	2.5	10.71%
18	2.8	2.38	15%
19	3.9	3.21	17.69%
20	2.78	2.68	3.6%
21	2.2	1.88	14.55%
22	2.58	2.52	2.33%
23	2.18	2.17	0.46%
24	2.05	0.63	69.27%
Geometric Mean	2.5	1.85	25.96%

On the average dco achieved improvement of 26% over the 64-bit code generated and optimized by the gcc version 4.2.2.

results for IA-32 32-bit code

The following are the results of optimizations achieved on 32-bit Linux operating system running on the 2.8GHz Pentium4 computer.

The gcc version 4.2.2 compiler, used to process the benchmarks, was invoked with the following command line options:

-S -O3 -fomit-frame-pointer -funroll-all-loops-ffast-math -march=pentium4 -mfpmath=sse -msse2

The dco version 1.1.1 was used to optimize compiler generated code. Note that dco's -32 command line option was used during optimization.

Kernel#	gcc 4.2.2	gcc+dco	gcc+dco/gcc
1	5.03	3.24	35.59%
2	2.46	2.36	4.07%
3	4.99	2.49	50.1%
4	5.04	3.84	23.81%
5	5.33	1.76	66.98%
6	16.24	5.06	68.84%
7	5.16	4.16	19.38%
8	3.87	3.91	-1.03%
9	4.95	4.01	18.99%
10	4.94	3.38	31.58%
11	4.93	0.85	82.76%
12	5.02	5.2	-3.59%
13	4.63	4.66	-0.65%
14	4.41	4.23	4.08%
15	5.44	4.47	17.83%
16	4.86	4.52	7.%
17	4.87	4.17	14.37%
18	4.57	3.61	21.01%
19	5.82	4.1	29.55%
20	4.53	4.43	2.21%
21	7.16	8.85	-23.6%
22	4.79	4.8	-0.21%
23	3.67	2.82	23.16%
24	4.85	0.84	82.68%
Geometric Mean	5.01	3.43	31.59%

On the average dco achieved improvement of 32% over the 32-bit code generated and optimized by the gcc version 4.2.2.