Project – Improving lrzip; Phase One

So now frost slowly thaws and the winds of blooming spring carry to us tidings of final projects. Our project for the SPO600 course centers around the improvement of a selected checksum or compression open-source package. I chose lrzip, and I hope that I will be able to alter the code in such a way that we can gain a non-negligible improvement in execution time. That, at least, is what I will be aiming for.

So, what even is lrzip?

lrzip – Long Range ZIP or LZMA RZIP

lrzip is a compression program. It’s optimized for the compression of large files and on systems with great amounts of RAM.

lrzip operates in two basic steps, and on a single file.

First, it uses an extended version of rzip to perform a long distance redundancy reduction (ie it encodes duplicate or redundant chunks of data over a large distance). This is done with an LZ77-based algorithm: a sliding window technique is used to keep data in memory in order to find matches, and larger amounts of RAM allow the use of a larger window and thus lead to better compression. Where rzip is limited to a maximum window size of 900MB, lrzip’s use of it scales with available RAM. Duplicate data is replaced with a reference denoting the distance to the original data and the size of the original chunk of data.

A visual representation of the LZ77 algorithm:
lz77

Second, after the initial reduction, lrzip compresses the file further by running one of LZMA (default), ZPAQ, LZO, GZIP, and BZIP2 on the reduced data file.

So, I built the package on my x86_64 computer (which required the installation of a few dependencies, but other than that it went okay) and on our AArch64 aarchie server (which, again, needed some dependencies (though some were already installed)). The program ran fine on both systems.

Before we begin tinkering with the code, let’s run some benchmarks on aarchie. We’ll use three files. The first is a text file containing 500MB of text downloaded from Project Gutenberg. The second is a 500MB file composed entirely of random data. The third is a 500MB file composed entirely of zero data.

[atopala@aarchie lrzip-0.621]$ time ./lrzip text
Output filename is: text.lrz
text - Compression Ratio: 3.918. Average Compression Speed:  1.217MB/s.
Total time: 00:06:50.91

real	6m50.924s
user	49m2.940s
sys		0m9.290s
[atopala@aarchie lrzip-0.621]$ time ./lrzip random 
Output filename is: random.lrz
random - Compression Ratio: 1.000. Average Compression Speed:  6.329MB/s.
Total time: 00:01:18.78

real	1m18.796s
user	1m34.520s
sys		0m2.960s
[atopala@aarchie lrzip-0.621]$ time ./lrzip zero
Output filename is: zero.lrz
zero - Compression Ratio: 7003.299. Average Compression Speed:  6.024MB/s.
Total time: 00:01:22.92

real	1m22.933s
user	5m17.620s
sys		0m2.490s

The text file took much longer than the others, and it used 23.9% of the memory on aarchie. Compressing the random file required the use of only about 7% of the memory on aarchie. Compressing the zero file, which has a lot more duplicate data to compress, used much more memory, peaking at about 34.5% on aarchie. The text file also took much more user time to compress in comparison to the other files and as compared to the real time, so it was taking greater advantage of parallel processing; the zero file also was compressed using multiple threads. The vast majority of the time was spent in user mode; time spent in the kernel is negligible.

So, the goal is to beat these benchmarks. Maybe a way to recognize incompressible data could speed up the processing of the random file; or maybe a method to detect extremely large runs of identical data quickly could help with the zero file. The important file, though, is the text file, and ideally we would improve the runtime of its compression. Setting the lzma level to 9 (the default is 7) brings the compression ratio up to 4.029 (a minor but not insignificant improvement) but takes twice as long; maybe this could be sped up.

So, apart from improving the algorithm (which, I think, is very unlikely, since lrzip has been receiving regular updates as recently as last year so the algorithm should be pretty close to perfect barring potential oversights) and changing build options (which might not prove fruitful, since the autogen sets the default optimization level to 2, and further optimizations might have a negative impact; but this is probably a good area to work in since I don’t think lrzip has been built with AArch64 in mind) these are potential areas of improvement:

-lrzip uses x86 and x86_64 assembly language files that optimize CRC computation. Maybe we could rewrite the files for AArch64. As of now, “if you don’t have the nasm assembler or have a ppc or other non-x86 system, the standard C CRC routines will be compiled and linked in.” The TODO file suggests that the x86_64 version of the assembly code doesn’t work, so that might need fixing. However, the TODO file also suggests that this is of a low priority.
-the program crashes with the -z flag (ZPAQ optimization) on aarchie, claiming “Illegal instruction” after starting the first thread to compress a piece of data. This occurs on betty as well. It doesn’t happen on x86_64. I am not yet sure of the cause, or if it something that happens on AArch64 systems in general, but if it is a bug on AArch64 then I think this should become my main focus.

I’ll begin with the ZPAQ bug (or what I think might be a bug, at least), and if that doesn’t pan out then I’ll try to edit some of the build options and code to optimize it for AArch64 (since I figure that there may be other people working on lrzip, but most might not have access to an AArch64 machine). If that doesn’t work out either, I’ll try to optimize the C code, and if I can’t do that I’ll work on the (low-priority) assembly language files.

GCC Compiler Options: -funsafe-math-optimizations and -ftracer

GCC contains several flags that can be set in order to guide the optimization of a file during compilation.

Let’s look at two of them:

-funsafe-math-optimizations

The gcc manual says that this option “allows optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards.”

Essentially, what this means is that the compiler will take certain shortcuts in calculating the results of floating-point operations, which may result in rounding errors or potentially the program’s malfunctioning (hence the unsafe part of the name).

-funsafe-math-optimizations enables the compiler options -fno-signed-zeros (which ignores the sign of zeroes), -fno-trapping-math (which assumes that the code won’t generate user-visible traps like division by zero, etc), -fassociative-math (which allows re-association of operands) and -freciprocal-math (which allows for the replacing of division with multiplication by reciprocals).

To test this, we’ll use a simple program that adds some fractions together lots of times and prints the result:

#include <stdio.h>
#include <stdlib.h>

int main() {
        double sum = 0.0;
        double d = 0.0;

        int i;

        for (i=1; i<=10000000; i++) {
                d = (float)i/3 + (float)i/7;
                sum += d;
        }

        printf("%.4f\n", sum);

        return 0;
}

We’ll compile it with the -funsafe-math-optimization flag and without:

[atopala@aarchie presentation]$ gcc -O0 test.c -o test
[atopala@aarchie presentation]$ gcc -O0 -funsafe-math-optimizations test.c -o test-math

I used the time command to compare the execution times of each compiled binary. Here is a screenshot of the results:

funsafe-math-optimizations1

The binary that had been compiled with the -funsafe-math-optimization flag was much faster, but the result was off by quite a bit.

Let’s use objdump -d to examine the disassembly of each binary file.

[atopala@aarchie presentation]$ objdump -d test
00000000004005d0 <main>:
  4005d0:       a9bd7bfd        stp     x29, x30, [sp,#-48]!
  4005d4:       910003fd        mov     x29, sp
  4005d8:       580006c0        ldr     x0, 4006b0 <main+0xe0>
  4005dc:       f90017a0        str     x0, [x29,#40]
  4005e0:       58000680        ldr     x0, 4006b0 <main+0xe0>
  4005e4:       f9000fa0        str     x0, [x29,#24]
  4005e8:       52800020        mov     w0, #0x1                        // #1
  4005ec:       b90027a0        str     w0, [x29,#36]
  4005f0:       14000023        b       40067c <main+0xac>
  4005f4:       b94027a0        ldr     w0, [x29,#36]
  4005f8:       1e220000        scvtf   s0, w0
  4005fc:       1e260001        fmov    w1, s0
  400600:       180005c0        ldr     w0, 4006b8 <main+0xe8>
  400604:       1e270021        fmov    s1, w1
  400608:       1e270000        fmov    s0, w0
  40060c:       1e201821        fdiv    s1, s1, s0
  400610:       1e260021        fmov    w1, s1
  400614:       b94027a0        ldr     w0, [x29,#36]
  400618:       1e220001        scvtf   s1, w0
  40061c:       1e260022        fmov    w2, s1
  400620:       180004e0        ldr     w0, 4006bc <main+0xec>
  400624:       1e270040        fmov    s0, w2
  400628:       1e270001        fmov    s1, w0
  40062c:       1e211800        fdiv    s0, s0, s1
  400630:       1e260000        fmov    w0, s0
  400634:       1e270020        fmov    s0, w1
  400638:       1e270001        fmov    s1, w0
  40063c:       1e212800        fadd    s0, s0, s1
  400640:       1e260000        fmov    w0, s0
  400644:       1e270000        fmov    s0, w0
  400648:       1e22c000        fcvt    d0, s0
  40064c:       9e660000        fmov    x0, d0
  400650:       f9000fa0        str     x0, [x29,#24]
  400654:       f94017a1        ldr     x1, [x29,#40]
  400658:       f9400fa0        ldr     x0, [x29,#24]
  40065c:       9e670021        fmov    d1, x1
  400660:       9e670000        fmov    d0, x0
  400664:       1e602821        fadd    d1, d1, d0
  400668:       9e660020        fmov    x0, d1
  40066c:       f90017a0        str     x0, [x29,#40]
  400670:       b94027a0        ldr     w0, [x29,#36]
  400674:       11000400        add     w0, w0, #0x1
  400678:       b90027a0        str     w0, [x29,#36]
  40067c:       b94027a1        ldr     w1, [x29,#36]
  400680:       5292d000        mov     w0, #0x9680                     // #38528
  400684:       72a01300        movk    w0, #0x98, lsl #16
  400688:       6b00003f        cmp     w1, w0
  40068c:       54fffb4d        b.le    4005f4 <main+0x24>
  400690:       90000000        adrp    x0, 400000 <_init-0x3f0>
  400694:       911d8000        add     x0, x0, #0x760
  400698:       fd4017a0        ldr     d0, [x29,#40]
  40069c:       97ffff71        bl      400460 <printf@plt>
  4006a0:       52800000        mov     w0, #0x0                        // #0
  4006a4:       a8c37bfd        ldp     x29, x30, [sp],#48
  4006a8:       d65f03c0        ret
  4006ac:       d503201f        nop
        ...
  4006b8:       40400000        .word   0x40400000
  4006bc:       40e00000        .word   0x40e00000
[atopala@aarchie presentation]$ objdump -d test-math
00000000004005d0 <main>:
  4005d0:       a9bd7bfd        stp     x29, x30, [sp,#-48]!
  4005d4:       910003fd        mov     x29, sp
  4005d8:       58000540        ldr     x0, 400680 <main+0xb0>
  4005dc:       f90017a0        str     x0, [x29,#40]
  4005e0:       58000500        ldr     x0, 400680 <main+0xb0>
  4005e4:       f9000fa0        str     x0, [x29,#24]
  4005e8:       52800020        mov     w0, #0x1                        // #1
  4005ec:       b90027a0        str     w0, [x29,#36]
  4005f0:       14000017        b       40064c <main+0x7c>
  4005f4:       b94027a0        ldr     w0, [x29,#36]
  4005f8:       1e220000        scvtf   s0, w0
  4005fc:       1e260001        fmov    w1, s0
  400600:       180003e0        ldr     w0, 40067c <main+0xac>
  400604:       1e270021        fmov    s1, w1
  400608:       1e270000        fmov    s0, w0
  40060c:       1e200821        fmul    s1, s1, s0
  400610:       1e260020        fmov    w0, s1
  400614:       1e270001        fmov    s1, w0
  400618:       1e22c021        fcvt    d1, s1
  40061c:       9e660020        fmov    x0, d1
  400620:       f9000fa0        str     x0, [x29,#24]
  400624:       f94017a1        ldr     x1, [x29,#40]
  400628:       f9400fa0        ldr     x0, [x29,#24]
  40062c:       9e670020        fmov    d0, x1
  400630:       9e670001        fmov    d1, x0
  400634:       1e612800        fadd    d0, d0, d1
  400638:       9e660000        fmov    x0, d0
  40063c:       f90017a0        str     x0, [x29,#40]
  400640:       b94027a0        ldr     w0, [x29,#36]
  400644:       11000400        add     w0, w0, #0x1
  400648:       b90027a0        str     w0, [x29,#36]
  40064c:       b94027a1        ldr     w1, [x29,#36]
  400650:       5292d000        mov     w0, #0x9680                     // #38528
  400654:       72a01300        movk    w0, #0x98, lsl #16
  400658:       6b00003f        cmp     w1, w0
  40065c:       54fffccd        b.le    4005f4 <main+0x24>
  400660:       90000000        adrp    x0, 400000 <_init-0x3f0>
  400664:       911ca000        add     x0, x0, #0x728
  400668:       fd4017a0        ldr     d0, [x29,#40]
  40066c:       97ffff7d        bl      400460 <printf@plt>
  400670:       52800000        mov     w0, #0x0                        // #0
  400674:       a8c37bfd        ldp     x29, x30, [sp],#48
  400678:       d65f03c0        ret
  40067c:       3ef3cf3d        .word   0x3ef3cf3d
        ...

The only thing that’s changed is that the original equation, which contained two division operations, has been replaced with an equation containing one multiplication operation.
Due to the nature of floating point arithmetic, the number there is an approximation of what would be the algebraic result; there is a tiny difference between the result of the first equation and the result of the second, and, as evidenced by our program, this tiny difference can add up.

The gcc manual says of -funsafe-math-optimizations that “this option is not turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.”

We’ve seen with our program that the code produced with this compiler option can be significantly faster. If precision is not an issue and the data being processed is either small enough or distributed in such a way that that large variations in the result don’t happen (as was the case in our program, where we just continually added to a number 10 million times), it might be okay to use -funsafe-math-optimizations for that extra speed boost–but, again, remember that it is unsafe, and might give unexpected results.

-ftracer

-ftracer, according to the gcc manual, “performs tail duplication to enlarge superblock size. This transformation simplifies the control flow of the function allowing other optimizations to do better job.”
Superblocks are used during trace scheduling, which is an optimization technique that predicts frequent execution paths in a program and tries to make it so that the program takes advantage of parallel processing by queuing up blocks of instructions that don’t depend on each other (ie, one block can begin to be executed without waiting for a branch or result from another). A superblock is a set of instructions in which control can enter only at the top–you can’t begin executing instructions in the middle of a superblock. The larger the superblock, the less time the processor has to wait for the completion of other instructions before proceeding.
Tail duplication is the process of copying common parts of code in order to create larger superblocks. For example, consider the following program:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
        srand(time(NULL));
        int x = rand() % 100;
        int y = rand() % 100;

        if (x == 10) {
                x++;
        }
        else {
                y++;
        }
        x += 10;
        y += 10;
        printf("%d, %d\n", x, y);

        return 0;
}

What tail duplication should do is copy the printf and x+=10 and y+=10 statements that are outside the conditional statement so that they can be placed inside both conditional branches. This creates larger blocks of code that can, ideally, be executed in parallel.

As the gcc manual says, this is especially useful when performing additional optimizations because it allows the compiler to move code blocks around more freely.

Compiling the program with only the -ftracer option doesn’t change it–the binaries produced by gcc test.c and gcc -ftracer test.c are identical. Compiling with -O2 and -ftracer, however, does. In the objdump listing, the file that had been compiled with -ftracer has an extra call to printf in its main section. Moreover, the program has extra instructions inside what would be the section of code that executes when x==0. This seems to imply that tail duplication has taken place.

I did try to build longer, more complicated programs and see what -ftracer does in their compilation, but the results were inconclusive to say the least. -ftracer seemed not to kick in unless at the higher levels of optimization, and those mangled the machine code beyond my ability to analyze it well enough to say anything important about it 😦 But the theory, I think, makes sense, and I imagine that -ftracer could produce faster code (certainly not smaller, on account of the tail duplication) when dealing with much larger and more complicated source files.

Update 18/03/16: So I presented my findings to the class. The professor said that -ftracer is one of the optimization flags that actually make the program worse, but do so in such a way that the intermediate representation is more compliant with other optimizations; so, by itself, it is not very useful, but with further optimizations it may make a significant difference.
It was also suggested that tail merging, which occurs by default at certain optimization levels, might have been thwarting my attempt to obtain a useful result from -ftracer. This, I think, is why it is so difficult to analyze optimized machine code: you’ve not really a clear idea of which optimizations have been applied, and where, and why (also when and how). Anyway, I think (hope) most people understood if not the exact intricacies of the flag’s effects then at least the basic idea behind it.

Here is a link to the presentation slides:
Presentation Slides

Further Reading

GCC documentation – optimizations
Semantics of Floating Point Math in GCC
Floating Point Representation
Superblock Scheduling and Tail Duplication>
The Superblock: An Effective Technique for VLIW and Superscalar Computation
VLIW Processors and Trace Scheduling: Super Block Scheduling