Project: Improving lrzip; Phase Two

In my last post, I mentioned that lrzip, when given the -z option, complains of an illegal instruction and crashes on AArch64. I did some debugging with GDB, hoping to find the cause of this, but the situation seemed to be that the program just sort of lost itself during execution:

Program received signal SIGILL, Illegal instruction.
0x000003ff7b000000 in ?? ()

Even though the package had been compiled with -g (to produce debugging information) and we’d downloaded all the separate debuginfos, the program still lost its way, and it couldn’t tell us where it was.

The reason for this turned out to be very simple. I took a look at the files in the libzpaq folder and found, in libzpaq.h, the following comment:

“By default, LIBZPAQ uses JIT (just in time) acceleration. This only
works on x86-32 and x86-64 processors that support the SSE2 instruction
set. To disable JIT, compile with -DNOJIT. To enable run time checks,
compile with -DDEBUG. Both options will decrease speed.”

JIT code is created during execution time and then executed. libzpaq.cpp contained several functions that filled various arrays with hex numbers representing machine instructions. Those arrays were then written to memory and executed as code. The problem is that those machine instructions are x86 instructions. So, on our AArch64 machines, the program was trying to execute data it had no business executing, and that’s why it crashed.

libzpaq.cpp has a check to make sure that the platform it’s being compiled on is x86, but it isn’t a very good check:

// x86? (not foolproof)
  const int S=sizeof(char*);      // 4 = x86, 8 = x86-64
  U32 t=0x12345678;
  if (*(char*)&t!=0x78 || (S!=4 && S!=8))
    error("JIT supported only for x86-32 and x86-64");

That snippet of code checks that the system uses little-endian format (which x86 does) and that a pointer is either 4 or 8 bytes long. AArch64 also uses little-endian format (AArch64 is bi-endian in regards to reading data, but by default it is little-endian) and has same-sized pointers, so it passes the test, which is something that shouldn’t happen.

It seems that this is a known problem. An issue was submitted to lrzip’s GitHub’s page a little over a month ago, but no solution was proposed beyond that mentioned in the libzpaq.h file (ie. passing to the C++ compiler the -DNOJIT flag when compiling the code).

I first thought to translate the encoded x86 instructions into the equivalent AArch64 instructions . . . but there are a lot of instructions, and I don’t think I could get it done in time (and I am sure I would run into trouble concerning the handling of things like x86’s variable-length vs AArch64’s 32-bit instructions, etc.)

I wanted to see how big of a difference the JIT code makes, so I tried running the ZPAQ algorithm on x86_64 with and without the -DNOJIT flag.

Here is the result of running lrzip -z normally:

[andrei@localhost lrzip-0.621]$ ./lrzip -z text 
Output filename is: text.lrz
text - Compression Ratio: 4.810. Average Compression Speed:  1.355MB/s. 
Total time: 00:06:08.42

To compile the program with the -DNOJIT flag, we just need to add it to the C++ flags when we run the configure script:

[andrei@localhost lrzip-master]$ ./configure CXXFLAGS="-g -O2 -DNOJIT"

The -g and -O2 flags are set by automake (which lrzip uses to create its configure script and makefiles) if CXXFLAGS isn’t defined by the user, so if we are defining CXXFLAGS we need to set them as well, for consistency’s sake.

Now, here’s the same operation without the JIT code:

[andrei@localhost lrzip-0.621]$ ./lrzip -z text 
Output filename is: text.lrz
text - Compression Ratio: 4.810. Average Compression Speed:  0.919MB/s. 
Total time: 00:09:04.74

Without the JIT code, the program takes 3 extra minutes, or slightly over 30% more time.

Chris suggested that such a large discrepancy between the JIT code and the compiled C code might indicate that the C code isn’t quite optimized as well as it could be. I looked at the C code and, while there are some small pieces I could try to optimize, I do not really understand the algorithm at all, and the code is difficult to read. I’ll try to keep going but I don’t have anything to show for my efforts here.

Anyway, at least I edited libzpaq.cpp. I added the following preprocessor directives near the top of the file:

// GNU
#ifdef __arm__
#define NOJIT 1
#endif
#ifdef __aarch64__
#define NOJIT 1
#endif

NOJIT, which is supposed to be set with the -DNOJIT flag, is checked in every function that uses JIT code, and if it is defined the program uses regular C code instead. So, with this change, if the preprocessor detects that we’re on an ARM machine, it just sets NOJIT and the other conditionals take care of the rest. It’s an inelegant solution (and I suppose it would make more sense to check instead that the program is running on an x86 architecture and enable NOJIT otherwise) but it works on aarchie and betty, and the ZPAQ algorithm defaults to using the C code. I’ve only tested this on gcc, though; other compilers have different (or no?) macros to check for ARM architectures, so this is not the best solution. I’ll try to refine it but I don’t know if it’ll be worth submitting to the community, since it is such a simple thing and could be achieved just as easily by compiling the code with the -DNOJIT flag.
Modifying the configure.ac script to check the architecture and add -DNOJIT to CXXFLAGS accordingly also works, but I think it’s better just to have a safeguard in the libzpaq.cpp source file itself, because adding an extra C++ flag by default on non-x86 machines is not really expected behavior.

I took a break from the ZPAQ algorithm and tried to find something to optimize in the regular LZMA code (which is the path taken by default when lrzip is called without specifying an algorithm), but this, too, proved fruitless (although the code is much easier to read). Two functions take up 90% of the execution time, and I couldn’t improve either of them.

Two pieces of code in particular (the first from LzmaEnc.c and the second from LzFind.c) I tried to wrest some improvement in performance from.

p->reps[3] = p->reps[2];
p->reps[2] = p->reps[1];
p->reps[1] = p->reps[0];
p->reps[0] = pos;

Here I tried to use memmove to shift the array (it’s an UInt32 array) forward all at once, but it didn’t have any effect on performance.

if (++len != lenLimit && pb[len] == cur[len])
          while (++len != lenLimit)
            if (pb[len] != cur[len])
              break;

This I tried to condense to a single while loop (or at least one while loop and a single if statement) in order to reduce the number of branches and operations, but that actually made the code slower. It might be the case here that the compiler optimizes this particular logic layout best.

So, that’s where things stand at the moment. It still looks like the ZPAQ C code might be the most promising venture, if only because I can’t seem to optimize the LZMA code at all. I’ll keep trying both options, though. I also haven’t looked into other compiler options yet, so that’s also still a possibility.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s