SPO600 Project: Phase Three and Conclusion

There is a saying attributed to Edison that goes something along the lines of “I have not failed–I have just found 10,000 ways that don’t work.”

The goal of the project was to optimize a selected open-source compression or checksum package and produce a meaningful improvement in performance. I chose lrzip, partly because it seemed interesting and partly because its creator is a physician, which is a career that I had at length considered myself back in high school before deciding that hospitals and life-and-death situations are scary. This might not have been the best choice (nor, I guess, the best reason to make a choice) because, well, lrzip looks pretty good. I could find no area of optimization that would amount to a substantive improvement. Whenever I thought I had come across an opportunity, it turned out not to be the case at all.

Optimization, I’ve come to realize, is pretty hard. Conversely, making a program run worse, even unintentionally, is very easy.

In phases one and two of this project I noted that lrzip crashed on AArch64 platforms when compressing a file with ZPAQ due to ZPAQ’s use of x86-specific JIT code. I thought to translate the x86 instructions into ARM instructions, so that JIT code could be used on both architectures, but that was too tall an order. I chose instead to focus on the C++ code that is used when JIT is unavailable, to try to optimize it and bring its runtime closer to that of the JIT version of the program. Also, I examined the LZMA code in the package (as LZMA is the default compression algorithm lrzip uses) and tried to optimize that as well.

Now, let me count the 10,000 ways.

The first order of business was to profile the program in order to see which functions took up the majority of the execution time. gprof was the tool for this. First I built the package with the -pg flag, which generates extra code to write profile information. This I did by setting it as one of the C, C++, and linker flags when calling the configure script:

[andrei@localhost lrzip-master]$ ./configure CFLAGS="-pg -O2" CXXFLAGS="-pg -O2" LDFLAGS="-pg"

Then I built the package using make (made the package?) and then ran lrzip on a 500MB text file.

[andrei@localhost lrzip-master]$ ./lrzip text
Output filename is: text.lrz
text - Compression Ratio: 3.937. Average Compression Speed:  2.041MB/s.
Total time: 00:04:04.53

This generated a gmon.out file in the directory, which lets us examine lrzip using the gprof tool. I used the following command to create an image from the profile information:

[andrei@localhost lrzip-master]$ gprof lrzip | gprof2dot | dot -Tpng > image.png

lrzip

We can see from the graph that the program spent the vast majority of its time in the function GetMatchesSpec1, which is defined in the file lzma/C/LzFind.c. The graph tells us that the function was called 200 million times, so I figured that any marginal improvement I could make to the function would result in a significant gain in performance.
The function uses two Byte pointers to move along the stream of data read from the text file, and two UInt32 pointers (typedef’d as CLzRef) to keep track of matches from the data in a circular buffer.

CLzRef *ptr0 = son + (_cyclicBufferPos << 1) + 1;
CLzRef *ptr1 = son + (_cyclicBufferPos << 1);

Since the two pointers point to adjacent data locations I tried to define one in terms of the other (removing the “+1” part of ptr0’s definition then setting ptr1=ptr0++), but it had no effect, and I think the machine code generated would be the same regardless.

if (++len != lenLimit && pb[len] == cur[len])
          while (++len != lenLimit)
            if (pb[len] != cur[len])
              break;

This I rewrote as a single while loop, but as I said in phase two it had no (or a negative) effect on the executable. The objdump revealed that both this code and my edit resulted in identical instructions with only a slight change in layout.
I also rewrote this section of the code entirely to check multiple bytes at a time by XORing chunks of data from pb and cur, but even before fine-tuning the code to set len to the proper value after the first different chunk of data was found, I realized that the overhead of doing all this made the program significantly slower, so I abandoned that course of action. (For the text file, the loop only ran a few times with every function call, so checking very large chunks of data would be counter-productive).

After that, I tried to see if I could find something to optimize in LzmaEnc_CodeOneBlock, which is in lzma/C/LzmaEnc.c. I couldn’t, at least not beyond that failed experiment with memmove that I wrote about in phase two.

So, LZMA I couldn’t make better. I then went back to ZPAQ. Here’s the profile information for lrzip when it’s invoked with the -z flag after having been compiled with -DNOJIT to turn off the JIT code:

lrzip-zpaq-dnojit

Both update0 and predict0 are structured as a for loop that uses a switch statement to check the value of an element of an array, which gets its value from an element in the block header array of a ZPAQL object, which represents ZPAQL instructions. I couldn’t find any way to optimize the functions without breaking the logic of the algorithm. The only thing I could do is replace some multiplication operations with bitshifts, but that didn’t affect the runtime (I think the compiler does that as well, if it can).
The execute function, which consists of a switch statement of over 200 cases, only takes up 3% of the execution time, and as much as I tried to optimize it I couldn’t get it to run more quickly by any measurable magnitude, so that didn’t work out either.

My attempts to edit the source code having been unsuccessful, I tried to see if any compiler options would make the program go faster, and, indeed, -Ofast (which enables -ffast-math) created an executable whose LZMA compression consistently outperformed the default -O2 executable by about 1% without introducing any errors into the compressed file. -ffast-math, though, is probably not something one would want in a compression tool, and there might be potential for mathematical errors somewhere in the code. Besides, the 1% difference, though I tested it quite a few times, might just have been a coincidence.

In the end, all I could do is add a small preprocessor condition in the libzpaq/libzpaq.h file to check for an x86 architecture and disable the JIT code if none is found. I created a patch file but I am not sure if it’s worth submitting since 1) it is such an insignificant change and 2) it only checks for GCC and Visual Studio macros, and if the program is compiled on an x86 processor with a different compiler then the JIT code will be deactivated, which will have a significant negative impact on performance. Here is the patch file for, at least, posterity:

From 3976cdc2640adbe593a09bba010130fcf74ef809 Mon Sep 17 00:00:00 2001
From: Andrei Topala <atopala@myseneca.ca>
Date: Thu, 21 Apr 2016 17:56:34 -0400
Subject: [PATCH] Added a preprocessor check for an x86 architecture in
 libzpaq.h as a requisite for enabling JIT code.

---
 libzpaq/libzpaq.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/libzpaq/libzpaq.h b/libzpaq/libzpaq.h
index 93387da..e2b5058 100644
--- a/libzpaq/libzpaq.h
+++ b/libzpaq/libzpaq.h
@@ -25,6 +25,11 @@ comprises the reference decoder for the ZPAQ level 2 standard.
 #ifndef LIBZPAQ_H
 #define LIBZPAQ_H

+// If we're not on an x86 processor, disable JIT
+#if !defined(i386) && !defined(_M_IX86) && !defined(__x86_64__) && !defined(_M_X64)
+#define NOJIT
+#endif
+
 #ifndef DEBUG
 #define NDEBUG 1
 #endif
-- 
2.5.5

That concludes this chapter in our journey through the wide waters of software portability and optimization. SPO600 was certainly the most challenging and most interesting course I’ve taken at Seneca, and I am sure I learned very much. I do wish I’d had more time to blog about it, though. That is my only regret, and also the 10,000 failures.

Inline Assembler Lab

Let’s look at the use of inline assembler in a C program.

Assembly-language code can be embedded into a C (or other high-level languages) source file in order to optimize a certain part of the program by using processor-specific instructions that are faster than what the compiler would create out of functionally-equivalent C code. Moreover, inline assembly code can be used for hardware-specific tasks that would be impossible with high-level code.

An example of inline assembly code can be found in K9Copy, a DVD backup tool. In the file src/dvdread/md5.h, there is the following piece of code:

#if defined __GNUC__ && defined __i386__
static md5_uint32
rol(md5_uint32 x, int n)
{
  __asm__("roll %%cl,%0"
          :"=r" (x)
          :"0" (x),"c" (n));
  return x;
}
#else
# define rol(x,n) ( ((x) << (n)) | ((x) >> (32-(n))) )
#endif

If the program is compiled with a GNU compiler and on an i386 processor, the function rol(md5_uint32 x, int n) is defined. It contains the assembly instruction roll (rotate left long), which rotates the 32-bit integer x to the left by n places, bringing bits that fall off the end back to the beginning. If the program is compiled with a different compiler or on a different processor, the function is instead replaced with a macro function that achieves the same result through bit shifts; it would use multiple instructions whereas the version using inline assembly uses only one.

This, I think, is a good use of inline assembly: it’s there to be used if it can make the executable faster, and if it can’t be used then there is an alternative. Unless there are machines that have the __i386__ macro defined but no rol function available (which I don’t think would be the case?), there are no problems with portability. The concept is sound. However, this particular source file was written in 1999. Modern compilers would almost certainly recognize the second function (that is, the macro function using bit shifts) and replace it with a rol instruction if possible. Using inline assembly for optimization is not so (relatively) straightforward a job as one might think, or as it would have been 20 years ago. Most attempts at optimization with inline assembly would probably be surpassed by the compiler’s optimizations; at worst, inline assembly might even impede the compiler and result in an executable that performs worse than it would have had the compiler been left to its own devices. There are, I’m sure, exceptions, and in some situations it could be the best course of action to use inline assembly in an optimizing capacity, but it probably shouldn’t be the first resort.

Inline assembly code should, of course, still be used for certain platform-specific operations or for tasks that cannot be done with high-level languages–those things, at least for the moment, lie beyond the powers of the compiler.

Project: Improving lrzip; Phase Two

In my last post, I mentioned that lrzip, when given the -z option, complains of an illegal instruction and crashes on AArch64. I did some debugging with GDB, hoping to find the cause of this, but the situation seemed to be that the program just sort of lost itself during execution:

Program received signal SIGILL, Illegal instruction.
0x000003ff7b000000 in ?? ()

Even though the package had been compiled with -g (to produce debugging information) and we’d downloaded all the separate debuginfos, the program still lost its way, and it couldn’t tell us where it was.

The reason for this turned out to be very simple. I took a look at the files in the libzpaq folder and found, in libzpaq.h, the following comment:

“By default, LIBZPAQ uses JIT (just in time) acceleration. This only
works on x86-32 and x86-64 processors that support the SSE2 instruction
set. To disable JIT, compile with -DNOJIT. To enable run time checks,
compile with -DDEBUG. Both options will decrease speed.”

JIT code is created during execution time and then executed. libzpaq.cpp contained several functions that filled various arrays with hex numbers representing machine instructions. Those arrays were then written to memory and executed as code. The problem is that those machine instructions are x86 instructions. So, on our AArch64 machines, the program was trying to execute data it had no business executing, and that’s why it crashed.

libzpaq.cpp has a check to make sure that the platform it’s being compiled on is x86, but it isn’t a very good check:

// x86? (not foolproof)
  const int S=sizeof(char*);      // 4 = x86, 8 = x86-64
  U32 t=0x12345678;
  if (*(char*)&t!=0x78 || (S!=4 && S!=8))
    error("JIT supported only for x86-32 and x86-64");

That snippet of code checks that the system uses little-endian format (which x86 does) and that a pointer is either 4 or 8 bytes long. AArch64 also uses little-endian format (AArch64 is bi-endian in regards to reading data, but by default it is little-endian) and has same-sized pointers, so it passes the test, which is something that shouldn’t happen.

It seems that this is a known problem. An issue was submitted to lrzip’s GitHub’s page a little over a month ago, but no solution was proposed beyond that mentioned in the libzpaq.h file (ie. passing to the C++ compiler the -DNOJIT flag when compiling the code).

I first thought to translate the encoded x86 instructions into the equivalent AArch64 instructions . . . but there are a lot of instructions, and I don’t think I could get it done in time (and I am sure I would run into trouble concerning the handling of things like x86’s variable-length vs AArch64’s 32-bit instructions, etc.)

I wanted to see how big of a difference the JIT code makes, so I tried running the ZPAQ algorithm on x86_64 with and without the -DNOJIT flag.

Here is the result of running lrzip -z normally:

[andrei@localhost lrzip-0.621]$ ./lrzip -z text 
Output filename is: text.lrz
text - Compression Ratio: 4.810. Average Compression Speed:  1.355MB/s. 
Total time: 00:06:08.42

To compile the program with the -DNOJIT flag, we just need to add it to the C++ flags when we run the configure script:

[andrei@localhost lrzip-master]$ ./configure CXXFLAGS="-g -O2 -DNOJIT"

The -g and -O2 flags are set by automake (which lrzip uses to create its configure script and makefiles) if CXXFLAGS isn’t defined by the user, so if we are defining CXXFLAGS we need to set them as well, for consistency’s sake.

Now, here’s the same operation without the JIT code:

[andrei@localhost lrzip-0.621]$ ./lrzip -z text 
Output filename is: text.lrz
text - Compression Ratio: 4.810. Average Compression Speed:  0.919MB/s. 
Total time: 00:09:04.74

Without the JIT code, the program takes 3 extra minutes, or slightly over 30% more time.

Chris suggested that such a large discrepancy between the JIT code and the compiled C code might indicate that the C code isn’t quite optimized as well as it could be. I looked at the C code and, while there are some small pieces I could try to optimize, I do not really understand the algorithm at all, and the code is difficult to read. I’ll try to keep going but I don’t have anything to show for my efforts here.

Anyway, at least I edited libzpaq.cpp. I added the following preprocessor directives near the top of the file:

// GNU
#ifdef __arm__
#define NOJIT 1
#endif
#ifdef __aarch64__
#define NOJIT 1
#endif

NOJIT, which is supposed to be set with the -DNOJIT flag, is checked in every function that uses JIT code, and if it is defined the program uses regular C code instead. So, with this change, if the preprocessor detects that we’re on an ARM machine, it just sets NOJIT and the other conditionals take care of the rest. It’s an inelegant solution (and I suppose it would make more sense to check instead that the program is running on an x86 architecture and enable NOJIT otherwise) but it works on aarchie and betty, and the ZPAQ algorithm defaults to using the C code. I’ve only tested this on gcc, though; other compilers have different (or no?) macros to check for ARM architectures, so this is not the best solution. I’ll try to refine it but I don’t know if it’ll be worth submitting to the community, since it is such a simple thing and could be achieved just as easily by compiling the code with the -DNOJIT flag.
Modifying the configure.ac script to check the architecture and add -DNOJIT to CXXFLAGS accordingly also works, but I think it’s better just to have a safeguard in the libzpaq.cpp source file itself, because adding an extra C++ flag by default on non-x86 machines is not really expected behavior.

I took a break from the ZPAQ algorithm and tried to find something to optimize in the regular LZMA code (which is the path taken by default when lrzip is called without specifying an algorithm), but this, too, proved fruitless (although the code is much easier to read). Two functions take up 90% of the execution time, and I couldn’t improve either of them.

Two pieces of code in particular (the first from LzmaEnc.c and the second from LzFind.c) I tried to wrest some improvement in performance from.

p->reps[3] = p->reps[2];
p->reps[2] = p->reps[1];
p->reps[1] = p->reps[0];
p->reps[0] = pos;

Here I tried to use memmove to shift the array (it’s an UInt32 array) forward all at once, but it didn’t have any effect on performance.

if (++len != lenLimit && pb[len] == cur[len])
          while (++len != lenLimit)
            if (pb[len] != cur[len])
              break;

This I tried to condense to a single while loop (or at least one while loop and a single if statement) in order to reduce the number of branches and operations, but that actually made the code slower. It might be the case here that the compiler optimizes this particular logic layout best.

So, that’s where things stand at the moment. It still looks like the ZPAQ C code might be the most promising venture, if only because I can’t seem to optimize the LZMA code at all. I’ll keep trying both options, though. I also haven’t looked into other compiler options yet, so that’s also still a possibility.

Project – Improving lrzip; Phase One

So now frost slowly thaws and the winds of blooming spring carry to us tidings of final projects. Our project for the SPO600 course centers around the improvement of a selected checksum or compression open-source package. I chose lrzip, and I hope that I will be able to alter the code in such a way that we can gain a non-negligible improvement in execution time. That, at least, is what I will be aiming for.

So, what even is lrzip?

lrzip – Long Range ZIP or LZMA RZIP

lrzip is a compression program. It’s optimized for the compression of large files and on systems with great amounts of RAM.

lrzip operates in two basic steps, and on a single file.

First, it uses an extended version of rzip to perform a long distance redundancy reduction (ie it encodes duplicate or redundant chunks of data over a large distance). This is done with an LZ77-based algorithm: a sliding window technique is used to keep data in memory in order to find matches, and larger amounts of RAM allow the use of a larger window and thus lead to better compression. Where rzip is limited to a maximum window size of 900MB, lrzip’s use of it scales with available RAM. Duplicate data is replaced with a reference denoting the distance to the original data and the size of the original chunk of data.

A visual representation of the LZ77 algorithm:
lz77

Second, after the initial reduction, lrzip compresses the file further by running one of LZMA (default), ZPAQ, LZO, GZIP, and BZIP2 on the reduced data file.

So, I built the package on my x86_64 computer (which required the installation of a few dependencies, but other than that it went okay) and on our AArch64 aarchie server (which, again, needed some dependencies (though some were already installed)). The program ran fine on both systems.

Before we begin tinkering with the code, let’s run some benchmarks on aarchie. We’ll use three files. The first is a text file containing 500MB of text downloaded from Project Gutenberg. The second is a 500MB file composed entirely of random data. The third is a 500MB file composed entirely of zero data.

[atopala@aarchie lrzip-0.621]$ time ./lrzip text
Output filename is: text.lrz
text - Compression Ratio: 3.918. Average Compression Speed:  1.217MB/s.
Total time: 00:06:50.91

real	6m50.924s
user	49m2.940s
sys		0m9.290s
[atopala@aarchie lrzip-0.621]$ time ./lrzip random 
Output filename is: random.lrz
random - Compression Ratio: 1.000. Average Compression Speed:  6.329MB/s.
Total time: 00:01:18.78

real	1m18.796s
user	1m34.520s
sys		0m2.960s
[atopala@aarchie lrzip-0.621]$ time ./lrzip zero
Output filename is: zero.lrz
zero - Compression Ratio: 7003.299. Average Compression Speed:  6.024MB/s.
Total time: 00:01:22.92

real	1m22.933s
user	5m17.620s
sys		0m2.490s

The text file took much longer than the others, and it used 23.9% of the memory on aarchie. Compressing the random file required the use of only about 7% of the memory on aarchie. Compressing the zero file, which has a lot more duplicate data to compress, used much more memory, peaking at about 34.5% on aarchie. The text file also took much more user time to compress in comparison to the other files and as compared to the real time, so it was taking greater advantage of parallel processing; the zero file also was compressed using multiple threads. The vast majority of the time was spent in user mode; time spent in the kernel is negligible.

So, the goal is to beat these benchmarks. Maybe a way to recognize incompressible data could speed up the processing of the random file; or maybe a method to detect extremely large runs of identical data quickly could help with the zero file. The important file, though, is the text file, and ideally we would improve the runtime of its compression. Setting the lzma level to 9 (the default is 7) brings the compression ratio up to 4.029 (a minor but not insignificant improvement) but takes twice as long; maybe this could be sped up.

So, apart from improving the algorithm (which, I think, is very unlikely, since lrzip has been receiving regular updates as recently as last year so the algorithm should be pretty close to perfect barring potential oversights) and changing build options (which might not prove fruitful, since the autogen sets the default optimization level to 2, and further optimizations might have a negative impact; but this is probably a good area to work in since I don’t think lrzip has been built with AArch64 in mind) these are potential areas of improvement:

-lrzip uses x86 and x86_64 assembly language files that optimize CRC computation. Maybe we could rewrite the files for AArch64. As of now, “if you don’t have the nasm assembler or have a ppc or other non-x86 system, the standard C CRC routines will be compiled and linked in.” The TODO file suggests that the x86_64 version of the assembly code doesn’t work, so that might need fixing. However, the TODO file also suggests that this is of a low priority.
-the program crashes with the -z flag (ZPAQ optimization) on aarchie, claiming “Illegal instruction” after starting the first thread to compress a piece of data. This occurs on betty as well. It doesn’t happen on x86_64. I am not yet sure of the cause, or if it something that happens on AArch64 systems in general, but if it is a bug on AArch64 then I think this should become my main focus.

I’ll begin with the ZPAQ bug (or what I think might be a bug, at least), and if that doesn’t pan out then I’ll try to edit some of the build options and code to optimize it for AArch64 (since I figure that there may be other people working on lrzip, but most might not have access to an AArch64 machine). If that doesn’t work out either, I’ll try to optimize the C code, and if I can’t do that I’ll work on the (low-priority) assembly language files.

GCC Compiler Options: -funsafe-math-optimizations and -ftracer

GCC contains several flags that can be set in order to guide the optimization of a file during compilation.

Let’s look at two of them:

-funsafe-math-optimizations

The gcc manual says that this option “allows optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards.”

Essentially, what this means is that the compiler will take certain shortcuts in calculating the results of floating-point operations, which may result in rounding errors or potentially the program’s malfunctioning (hence the unsafe part of the name).

-funsafe-math-optimizations enables the compiler options -fno-signed-zeros (which ignores the sign of zeroes), -fno-trapping-math (which assumes that the code won’t generate user-visible traps like division by zero, etc), -fassociative-math (which allows re-association of operands) and -freciprocal-math (which allows for the replacing of division with multiplication by reciprocals).

To test this, we’ll use a simple program that adds some fractions together lots of times and prints the result:

#include <stdio.h>
#include <stdlib.h>

int main() {
        double sum = 0.0;
        double d = 0.0;

        int i;

        for (i=1; i<=10000000; i++) {
                d = (float)i/3 + (float)i/7;
                sum += d;
        }

        printf("%.4f\n", sum);

        return 0;
}

We’ll compile it with the -funsafe-math-optimization flag and without:

[atopala@aarchie presentation]$ gcc -O0 test.c -o test
[atopala@aarchie presentation]$ gcc -O0 -funsafe-math-optimizations test.c -o test-math

I used the time command to compare the execution times of each compiled binary. Here is a screenshot of the results:

funsafe-math-optimizations1

The binary that had been compiled with the -funsafe-math-optimization flag was much faster, but the result was off by quite a bit.

Let’s use objdump -d to examine the disassembly of each binary file.

[atopala@aarchie presentation]$ objdump -d test
00000000004005d0 <main>:
  4005d0:       a9bd7bfd        stp     x29, x30, [sp,#-48]!
  4005d4:       910003fd        mov     x29, sp
  4005d8:       580006c0        ldr     x0, 4006b0 <main+0xe0>
  4005dc:       f90017a0        str     x0, [x29,#40]
  4005e0:       58000680        ldr     x0, 4006b0 <main+0xe0>
  4005e4:       f9000fa0        str     x0, [x29,#24]
  4005e8:       52800020        mov     w0, #0x1                        // #1
  4005ec:       b90027a0        str     w0, [x29,#36]
  4005f0:       14000023        b       40067c <main+0xac>
  4005f4:       b94027a0        ldr     w0, [x29,#36]
  4005f8:       1e220000        scvtf   s0, w0
  4005fc:       1e260001        fmov    w1, s0
  400600:       180005c0        ldr     w0, 4006b8 <main+0xe8>
  400604:       1e270021        fmov    s1, w1
  400608:       1e270000        fmov    s0, w0
  40060c:       1e201821        fdiv    s1, s1, s0
  400610:       1e260021        fmov    w1, s1
  400614:       b94027a0        ldr     w0, [x29,#36]
  400618:       1e220001        scvtf   s1, w0
  40061c:       1e260022        fmov    w2, s1
  400620:       180004e0        ldr     w0, 4006bc <main+0xec>
  400624:       1e270040        fmov    s0, w2
  400628:       1e270001        fmov    s1, w0
  40062c:       1e211800        fdiv    s0, s0, s1
  400630:       1e260000        fmov    w0, s0
  400634:       1e270020        fmov    s0, w1
  400638:       1e270001        fmov    s1, w0
  40063c:       1e212800        fadd    s0, s0, s1
  400640:       1e260000        fmov    w0, s0
  400644:       1e270000        fmov    s0, w0
  400648:       1e22c000        fcvt    d0, s0
  40064c:       9e660000        fmov    x0, d0
  400650:       f9000fa0        str     x0, [x29,#24]
  400654:       f94017a1        ldr     x1, [x29,#40]
  400658:       f9400fa0        ldr     x0, [x29,#24]
  40065c:       9e670021        fmov    d1, x1
  400660:       9e670000        fmov    d0, x0
  400664:       1e602821        fadd    d1, d1, d0
  400668:       9e660020        fmov    x0, d1
  40066c:       f90017a0        str     x0, [x29,#40]
  400670:       b94027a0        ldr     w0, [x29,#36]
  400674:       11000400        add     w0, w0, #0x1
  400678:       b90027a0        str     w0, [x29,#36]
  40067c:       b94027a1        ldr     w1, [x29,#36]
  400680:       5292d000        mov     w0, #0x9680                     // #38528
  400684:       72a01300        movk    w0, #0x98, lsl #16
  400688:       6b00003f        cmp     w1, w0
  40068c:       54fffb4d        b.le    4005f4 <main+0x24>
  400690:       90000000        adrp    x0, 400000 <_init-0x3f0>
  400694:       911d8000        add     x0, x0, #0x760
  400698:       fd4017a0        ldr     d0, [x29,#40]
  40069c:       97ffff71        bl      400460 <printf@plt>
  4006a0:       52800000        mov     w0, #0x0                        // #0
  4006a4:       a8c37bfd        ldp     x29, x30, [sp],#48
  4006a8:       d65f03c0        ret
  4006ac:       d503201f        nop
        ...
  4006b8:       40400000        .word   0x40400000
  4006bc:       40e00000        .word   0x40e00000
[atopala@aarchie presentation]$ objdump -d test-math
00000000004005d0 <main>:
  4005d0:       a9bd7bfd        stp     x29, x30, [sp,#-48]!
  4005d4:       910003fd        mov     x29, sp
  4005d8:       58000540        ldr     x0, 400680 <main+0xb0>
  4005dc:       f90017a0        str     x0, [x29,#40]
  4005e0:       58000500        ldr     x0, 400680 <main+0xb0>
  4005e4:       f9000fa0        str     x0, [x29,#24]
  4005e8:       52800020        mov     w0, #0x1                        // #1
  4005ec:       b90027a0        str     w0, [x29,#36]
  4005f0:       14000017        b       40064c <main+0x7c>
  4005f4:       b94027a0        ldr     w0, [x29,#36]
  4005f8:       1e220000        scvtf   s0, w0
  4005fc:       1e260001        fmov    w1, s0
  400600:       180003e0        ldr     w0, 40067c <main+0xac>
  400604:       1e270021        fmov    s1, w1
  400608:       1e270000        fmov    s0, w0
  40060c:       1e200821        fmul    s1, s1, s0
  400610:       1e260020        fmov    w0, s1
  400614:       1e270001        fmov    s1, w0
  400618:       1e22c021        fcvt    d1, s1
  40061c:       9e660020        fmov    x0, d1
  400620:       f9000fa0        str     x0, [x29,#24]
  400624:       f94017a1        ldr     x1, [x29,#40]
  400628:       f9400fa0        ldr     x0, [x29,#24]
  40062c:       9e670020        fmov    d0, x1
  400630:       9e670001        fmov    d1, x0
  400634:       1e612800        fadd    d0, d0, d1
  400638:       9e660000        fmov    x0, d0
  40063c:       f90017a0        str     x0, [x29,#40]
  400640:       b94027a0        ldr     w0, [x29,#36]
  400644:       11000400        add     w0, w0, #0x1
  400648:       b90027a0        str     w0, [x29,#36]
  40064c:       b94027a1        ldr     w1, [x29,#36]
  400650:       5292d000        mov     w0, #0x9680                     // #38528
  400654:       72a01300        movk    w0, #0x98, lsl #16
  400658:       6b00003f        cmp     w1, w0
  40065c:       54fffccd        b.le    4005f4 <main+0x24>
  400660:       90000000        adrp    x0, 400000 <_init-0x3f0>
  400664:       911ca000        add     x0, x0, #0x728
  400668:       fd4017a0        ldr     d0, [x29,#40]
  40066c:       97ffff7d        bl      400460 <printf@plt>
  400670:       52800000        mov     w0, #0x0                        // #0
  400674:       a8c37bfd        ldp     x29, x30, [sp],#48
  400678:       d65f03c0        ret
  40067c:       3ef3cf3d        .word   0x3ef3cf3d
        ...

The only thing that’s changed is that the original equation, which contained two division operations, has been replaced with an equation containing one multiplication operation.
Due to the nature of floating point arithmetic, the number there is an approximation of what would be the algebraic result; there is a tiny difference between the result of the first equation and the result of the second, and, as evidenced by our program, this tiny difference can add up.

The gcc manual says of -funsafe-math-optimizations that “this option is not turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.”

We’ve seen with our program that the code produced with this compiler option can be significantly faster. If precision is not an issue and the data being processed is either small enough or distributed in such a way that that large variations in the result don’t happen (as was the case in our program, where we just continually added to a number 10 million times), it might be okay to use -funsafe-math-optimizations for that extra speed boost–but, again, remember that it is unsafe, and might give unexpected results.

-ftracer

-ftracer, according to the gcc manual, “performs tail duplication to enlarge superblock size. This transformation simplifies the control flow of the function allowing other optimizations to do better job.”
Superblocks are used during trace scheduling, which is an optimization technique that predicts frequent execution paths in a program and tries to make it so that the program takes advantage of parallel processing by queuing up blocks of instructions that don’t depend on each other (ie, one block can begin to be executed without waiting for a branch or result from another). A superblock is a set of instructions in which control can enter only at the top–you can’t begin executing instructions in the middle of a superblock. The larger the superblock, the less time the processor has to wait for the completion of other instructions before proceeding.
Tail duplication is the process of copying common parts of code in order to create larger superblocks. For example, consider the following program:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
        srand(time(NULL));
        int x = rand() % 100;
        int y = rand() % 100;

        if (x == 10) {
                x++;
        }
        else {
                y++;
        }
        x += 10;
        y += 10;
        printf("%d, %d\n", x, y);

        return 0;
}

What tail duplication should do is copy the printf and x+=10 and y+=10 statements that are outside the conditional statement so that they can be placed inside both conditional branches. This creates larger blocks of code that can, ideally, be executed in parallel.

As the gcc manual says, this is especially useful when performing additional optimizations because it allows the compiler to move code blocks around more freely.

Compiling the program with only the -ftracer option doesn’t change it–the binaries produced by gcc test.c and gcc -ftracer test.c are identical. Compiling with -O2 and -ftracer, however, does. In the objdump listing, the file that had been compiled with -ftracer has an extra call to printf in its main section. Moreover, the program has extra instructions inside what would be the section of code that executes when x==0. This seems to imply that tail duplication has taken place.

I did try to build longer, more complicated programs and see what -ftracer does in their compilation, but the results were inconclusive to say the least. -ftracer seemed not to kick in unless at the higher levels of optimization, and those mangled the machine code beyond my ability to analyze it well enough to say anything important about it 😦 But the theory, I think, makes sense, and I imagine that -ftracer could produce faster code (certainly not smaller, on account of the tail duplication) when dealing with much larger and more complicated source files.

Update 18/03/16: So I presented my findings to the class. The professor said that -ftracer is one of the optimization flags that actually make the program worse, but do so in such a way that the intermediate representation is more compliant with other optimizations; so, by itself, it is not very useful, but with further optimizations it may make a significant difference.
It was also suggested that tail merging, which occurs by default at certain optimization levels, might have been thwarting my attempt to obtain a useful result from -ftracer. This, I think, is why it is so difficult to analyze optimized machine code: you’ve not really a clear idea of which optimizations have been applied, and where, and why (also when and how). Anyway, I think (hope) most people understood if not the exact intricacies of the flag’s effects then at least the basic idea behind it.

Here is a link to the presentation slides:
Presentation Slides

Further Reading

GCC documentation – optimizations
Semantics of Floating Point Math in GCC
Floating Point Representation
Superblock Scheduling and Tail Duplication>
The Superblock: An Effective Technique for VLIW and Superscalar Computation
VLIW Processors and Trace Scheduling: Super Block Scheduling

Vectorization Lab

Edit: This should be all okay now. (Sorry for posting this blog post incomplete–I had to watch Leo win his Oscar).

Here we’ll look at single instruction/multiple data (SIMD) vectorization.

A vector is a a series of equal-length data elements on which operations can be performed. SIMD instructions, as the name suggest, are used to perform an operation in parallel on multiple operands. Vectorization, then, is the process of converting parts of a program to a vector implementation so that SIMD instructions can be used in order to minimize processing time. A compiler can perform auto-vectorization during a source program’s compilation in order to optimize the resultant machine code so that it uses SIMD instructions, which can, depending on the size of data being processed, affect appreciably the running time of the program.

Consider, for instance, the following piece of code:

for (i=0; i<1000; i++) {
        arrSum[i] = arr1[i] + arr2[i];         
}

That loop has potential for vectorization. Instead of performing the addition on each pair of elements one by one, the compiler can use a vector implementation to process multiple pairs of elements at once so that the loop runs fewer times.

To invoke auto-vectorization, we need to give gcc the -O3 option, which tells it to go all out with optimization.

gcc -g -O3 lab6.c -o lab6

Here is the source code itself:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
	int arr1[1000];
	int arr2[1000];
	int arrSum[1000];
	long int sum;

	int i;
	srand(time(NULL));

	for (i=0; i<1000; i++) {
		// The maximum value of rand() is the same as the capacity of int, so we halve the random number to avoid overflowing during addition
		arr1[i] = rand() / 2;
		arr2[i] = rand() / 2;
	}

	for (i=0; i<1000; i++) {
		arrSum[i] = arr1[i] + arr2[i];
	}

	for (i=0; i<1000; i++) {
		sum += arrSum[i];
	}
	
	printf("%ld\n", sum);

	return 0;
}

Now let’s look at the machine code disassembly. I’ve extracted the main section of the code and annotated it to explain what is happening as I best could:

0000000000400550 <main>;
// Here we reserve space on the stack for local variables
  400550:	d1400bff 	sub	sp, sp, #0x2, lsl #12
  400554:	d13b83ff 	sub	sp, sp, #0xee0
// Here we are storing a pair of double words to a location on the stack; these two registers are to be used as the frame and link registers
  400558:	a9bd7bfd 	stp	x29, x30, [sp,#-48]!
// Here we move the current stack pointer to the frame register
  40055c:	910003fd 	mov	x29, sp
  400560:	d2800000 	mov	x0, #0x0                   	// #0
// Again storing a pair of double words
  400564:	a90153f3 	stp	x19, x20, [sp,#16]
// Store now a single register to a location on the stack
  400568:	f90013f5 	str	x21, [sp,#32]
// Here the program branches to the time and srand functions to initialize the random seed
  40056c:	97ffffdd 	bl	4004e0 <time@plt>
  400570:	97fffff0 	bl	400530 <srand@plt>
// Here the first loop begins, in which every element of the two arrays is set to a random number
  400574:	d281fa01 	mov	x1, #0xfd0                 	// #4048
  400578:	d2800013 	mov	x19, #0x0                   	// #0
  40057c:	9100c3b5 	add	x21, x29, #0x30
  400580:	8b1d0034 	add	x20, x1, x29
  400584:	97ffffdf 	bl	400500 <rand@plt>
  400588:	0b407c00 	add	w0, w0, w0, lsr #31
  40058c:	13017c00 	asr	w0, w0, #1
  400590:	b8356a60 	str	w0, [x19,x21]
  400594:	97ffffdb 	bl	400500 <rand@plt>
  400598:	0b407c00 	add	w0, w0, w0, lsr #31
  40059c:	13017c00 	asr	w0, w0, #1
  4005a0:	b8346a60 	str	w0, [x19,x20]
// In the first loop, the counter is incremented by 4 and compared to 4000, so the loop runs 1000 times
  4005a4:	91001273 	add	x19, x19, #0x4
  4005a8:	f13e827f 	cmp	x19, #0xfa0
  4005ac:	54fffec1 	b.ne	400584 <main+0x34>
  4005b0:	d2800000 	mov	x0, #0x0                   	// #0
// This is the entry point for the second loop, after the counter is set to 0
// We add 48 to r29 (which contains the address of the stack pointer where our arrays begin) and store that in a register to be used for arguments and return values
  4005b4:	9100c3a3 	add	x3, x29, #0x30
// Then we add the loop counter to the register and store that in another register
  4005b8:	8b000062 	add	x2, x3, x0
// Now we add 4048 to the frame register and store it in r3
// We are effectively jumping over the first 4000 bytes (ie the first 1000-element int array) in the stack
  4005bc:	913f43a3 	add	x3, x29, #0xfd0
// And again we add the loop counter to r3 and store it this time in r1
  4005c0:	8b000061 	add	x1, x3, x0
// r1 and r2, I think, are used to move around the memory space to read from and write to the three arrays we are using; essentially we're using r0, which is the loop counter, as the offset from the two starting locations of the two arrays we are using
// Here we load 4 single words from the memory addressed by r2
  4005c4:	4c407841 	ld1	{v1.4s}, [x2]
// Now we're moving 8048 to r2, which we will use to jump ahead and find the third array
  4005c8:	d283ee02 	mov	x2, #0x1f70                	// #8048
// Now we load 4 single words from the memory addressed by r1
  4005cc:	4c407820 	ld1	{v0.4s}, [x1]
// Then add the stack register's value to r2 (effectively setting r2 to the memory address of the third array)...
  4005d0:	8b1d0042 	add	x2, x2, x29
// ...and the loop counter to r2 and store it in r1 (so, we store in r1 the address of the nth element in the third array, where n is the loop counter)
  4005d4:	8b000041 	add	x1, x2, x0
// This is what it's all for: vector addition
  4005d8:	4ea08420 	add	v0.4s, v1.4s, v0.4s
// In the second loop, the counter is incremented by 16 and compared to 4000, so the loop runs only 250 times (and since we are processing 4 numbers each iteration, that's 1000 numbers in total)
  4005dc:	91004000 	add	x0, x0, #0x10
// Now we store the 4 sums to the memory addressed by r1 (that is, to the four consecutive elements starting from the nth in the third array, where n is the loop counter)
  4005e0:	4c007820 	st1	{v0.4s}, [x1]
// This is where the loop counter is compared to 4000
  4005e4:	f13e801f 	cmp	x0, #0xfa0
  4005e8:	54fffe61 	b.ne	4005b4 <main+0x64>
// We will move ahead another 4000 bytes, so we pass over the third array
  4005ec:	d285e201 	mov	x1, #0x2f10                	// #12048
// r2, at this point, should, I think, hold the memory address of the start of the third array (ie the array whose elements we are now summing)
// We move that value to r0
  4005f0:	aa0203e0 	mov	x0, x2
// We add the frame register's value to r1, which will now point to the empty memory space after the three arrays
  4005f4:	8b1d0021 	add	x1, x1, x29
// Move immediate: each of the four single words here are set to 0
  4005f8:	4f000401 	movi	v1.4s, #0x0
// r0 points to the third array, so we load four single words of it; then r0 is incremented by 16 bytes (which is the total number of bytes we're working with in this instruction (4 single words x 32 bits per word = 128 bits = 16 bytes))
// This is also the point to which the loop will return on successive iterations
  4005fc:	4cdf7800 	ld1	{v0.4s}, [x0], #16
// Signed integer lengthen
// Two of the single words in vector 0 are made double words and put into vector 2
  400600:	0f20a402 	sxtl	v2.2d, v0.2s
// Then the two double words in vector 2 are added to the two double words in vector 1
  400604:	4ee18441 	add	v1.2d, v2.2d, v1.2d
// Now the second part of vector 0 is made into double words
  400608:	4f20a400 	sxtl2	v0.2d, v0.4s
// Here we check to see if r0 points to the same memory address as r1 (that is, if we've gone through all the elements of the third array and reached the empty space beyond)
  40060c:	eb01001f 	cmp	x0, x1
// And now vector 0 (which contains what were the latter two single words of vector 0 and are now two double words) is added to vector 1
// At the end of the loop, vector 1 contains two numbers which, when added together, will equal the total sum of all the numbers in the third array
  400610:	4ee18401 	add	v1.2d, v0.2d, v1.2d
  400614:	54ffff41 	b.ne	4005fc <main+0xac>
// Scalar reduce; we add the two numbers in vector 1 together into a double-precision register, which is the same register and contains the sum of the two doublewords it contained before
  400618:	5ef1b821 	addp	d1, v1.2d
  40061c:	90000000 	adrp	x0, 400000 <_init-0x4a0>
// Move the first and second 64-bit elements from vector 1 to two separate registers
// This might be so that they can be used as arguments for printf?
  400620:	4e083c21 	mov	x1, v1.d[0]
  400624:	4e183c22 	mov	x2, v1.d[1]
  400628:	91214000 	add	x0, x0, #0x850
  40062c:	97ffffc5 	bl	400540 <printf@plt>
  400630:	a94153f3 	ldp	x19, x20, [sp,#16]
  400634:	f94013f5 	ldr	x21, [sp,#32]
  400638:	a8c37bfd 	ldp	x29, x30, [sp],#48
  40063c:	52800000 	mov	w0, #0x0                   	// #0
  400640:	913b83ff 	add	sp, sp, #0xee0
  400644:	91400bff 	add	sp, sp, #0x2, lsl #12
  400648:	d65f03c0 	ret
  40064c:	00000000 	.inst	0x00000000 ; undefined

The important instructions for vector addition are the following:

LD1 {Vt.<T>}, [base]
ADD Vd.<T>, Vn.<T>, Vm.<T>
ST1 {Vt.<T>}, [base]

These instructions load into the vector register multiple 1-element structures from the memory addressed by the base register; add the corresponding elements of the two vectors and store the sum in another vector; and store multiple 1-element elements from the vector to the memory addressed by the base register.

Then, to add together all the numbers in an array to a single long int, further instructions are used:

SXTL Vd.<Td>, Vn.<Ts>
SXTL2 Vd.<Td>, Vn.<Ts>
ADDP Dd, Vn.2D

These take the signed integers in the first half of the source vector, lengthen them (eg converts two single words into two double words), and store them in the destination vector; do the same to the second half of the source vector; and add together the two double-word elements in the source vector and store them in the destination register.

If our Algorithm Selection Lab were to use SIMD instructions, it would probably use the MUL instruction, which can multiply the integers in two vectors. However, the second operand we need there is a floating-point number (that is, the volume scaling factor); vector multiplication can’t be performed on vectors with elements of different types, so we would need to convert the floating-point number to an integer in order to use it in a vector multiplication. This, I think, would be done with the FCVTxS instruction, which converts a floating-point vector to a signed integer vector of the same size. Then we could use the SXTL instructions to lengthen the half-word (16-bit) integer vector so that the elements are the same size as the integer vector containing the word (32-bit) integers converted from the floating point. Then we could finally use the MUL instruction to multiply them together. (I am not sure how the 32-bit integer converted from the floating point of value ranging from 0.000 to 1.000 would look, so I don’t know if there is the potential for overflowing past the 32-bit limit when we multiply, but since the other operand was initially only 16 bits it should be okay, I think).

Analyzing the disassembly listing was pretty difficult, mainly because I am still at that stage where the assembler instructions don’t really mean anything to me in and of themselves, so, as with a foreign language of which you have only a tenuous grasp, I have to translate each instruction in my head, and sometimes reach for a dictionary (the ARMv8 Instruction Set Overview seems esoteric and scary at first glance, but it is very helpful). Reading a disassembly listing is like being lost in a strange land and getting directions from someone who speaks only that language, and this person insists on giving very, very detailed directions.
Another challenge was keeping track of what is stored in each register. Since most of the registers are general-purpose, and all you have to go on is the register’s number and the width of what’s inside, it’s easy to get them all mixed up. Kind of makes you appreciate the ability to give variables really long and descriptive names in higher-level programming languages.

Anyway, auto-vectorization is neat and it’s easy to see how SIMD could be very useful when processing operations.

Algorithm Selection Lab

Let’s examine the computational cost of performing lots of multiplication operations versus the computational cost of precomputing every possible result and getting the answers from a lookup table as needed. We’ll use each technique to scale a sequence of sound samples, and we’ll compare the completion times to each other on an x86_64 machine and on two AArch64 servers.

Some mathematical operations are more complex than others, and this translates into longer execution times for programs that use more costly operations. Multiplication, for instance, is more expensive than addition (on Intel CPUs, according to the reference manual, an IMUL operation is at least three times slower than an ADD operation). There is, then, potential for optimization in modifying the algorithms we use so that they depend upon less costly operations.

The algorithm we’ll try to optimize is a very simple one that can be used to adjust the volume of sound samples.

First, a little background. On digital machines, audio is usually represented through a process called pulse-code modulation, in which an analog audio signal is sampled at uniform intervals to produce a series of snapshots of the sound wave. These samples are then encoded as binary data, which can be saved as audio files (wav, mp3, flac, etc.).

500px-4-bit-linear-pcm-svg
A sound wave encoded into 4-bit PCM digital data.

These audio files are, at the most basic level, a collection of numbers representing the amplitude of the sound wave at a given time (it is a bit more complicated than that, and most audio files use complex algorithms for compression and such, but for our purposes now that’s all we need to know). If we were to multiply every one of those numbers by a scaling factor, we would effectively be changing the sound wave’s amplitude, and thus we would be changing its volume. That’s what we’ll be doing here.

We’ll start by using the sox utility to convert a WAV music file (which contains header information that specifies the sample rate, bit rate, encoding type, etc) into a raw audio file:

sox 03\ Over\ The\ Hills\ And\ Far\ Away.wav -t raw sound.raw

This produces a raw 16-bit signed integer PCM file, which is essentially just the sequence of 16-bit integers representing the samples. We can’t play the file as we could the WAV file because audio player software needs to know how to play it (that is, how many bits represent one sample, how many samples should be played per second, etc.). Our raw file, however, is encoded in the same way that the original WAV file is, and we can find this information by using the soxi command:

[andrei@localhost test]$ soxi 03\ Over\ The\ Hills\ And\ Far\ Away.wav 

Input File : '03 Over The Hills And Far Away.wav'
Channels : 2
Sample Rate : 44100
Precision : 16-bit
Duration : 00:04:50.06 = 12791808 samples = 21754.8 CDDA sectors
File Size : 51.2M
Bit Rate : 1.41M
Sample Encoding: 16-bit Signed Integer PCM

Now we can use the play command (which also comes with sox) with these arguments to play the raw sound file:

[andrei@localhost test]$ play -b 16 -r 44100 -c 2 -e signed-integer sound.raw

You can’t hear it so you’ll have to trust me when I say that it sounds exactly the same as the original file.

Now let’s take that sound.raw file and try to reduce its volume by half. I wrote the following program:


#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <stdlib.h>

#define MAX 65536
#define HALF 32768 
#define OUTPUT "output.raw"

int16_t adjustVolume(int16_t sample, float factor) {
 return (int16_t)(sample * factor);
}

int16_t adjustVolume2(int16_t sample, float factor) {
 static int16_t* lookupTable = NULL;
 static float oldFactor = 0;
 if (factor != oldFactor) {
  clock_t t = clock();
  oldFactor = factor;
  if (lookupTable) {
   free(lookupTable);
  }
  lookupTable = (int16_t*)malloc(MAX*sizeof(int16_t));
  int i;
  for (i=0; i<MAX; i++) {
    // Offset i by half the maximum value of an unsigned int16_t to fill the table with all the possible signed int16_t values
    lookupTable[i] = (int16_t)((i-HALF) * factor);
  }
  t = clock() - t;
  printf("Took %d cycles (%f seconds) to create the lookup table.\n", t, (float)t/CLOCKS_PER_SEC);
  }
 return lookupTable[sample + HALF];
}

int main(int argc, char *argv[]) {
 int i;
 int j;
 clock_t t;

 // Pointers to input and output files
 FILE *fp;
 FILE *fp2;

 // Array to contain all the samples
 int16_t *samples;

 // The size of the imput file
 long size;
 // Number of sound samples in the input file
 int numSamples;

 float volumeFactor;

 fp = fopen(argv[1], "r");
 fp2 = fopen(OUTPUT, "w");

 // Find the size (in bytes) of the input file
 fseek(fp, 0L, SEEK_END);
 size = ftell(fp);
 fseek(fp, 0L, SEEK_SET);

 // Divide the file size into chunks of int16_t
 numSamples = size / sizeof(int16_t);

 // Read all the samples from the file into memory
 samples = (int16_t*)malloc(numSamples*sizeof(int16_t));
 fread(samples, sizeof(int16_t), numSamples, fp);

 volumeFactor = 0.5;

 printf("Number of samples from sound file: %d\n", numSamples);
 t = clock();
 for (i=0; i<numSamples; i++) {
  samples[i] = adjustVolume(samples[i], volumeFactor);
  // samples[i] = adjustVolume2(samples[i], volumeFactor);
 }
 t = clock() - t;

 printf("%d cycles (%f seconds)\n", t, (float)t/CLOCKS_PER_SEC);

 // Write the adjusted samples to the output file
 fwrite(samples, sizeof(int16_t), numSamples, fp2);

 free(samples);

 fclose(fp);
 fclose(fp2);

 return 0;
}

There are two functions for adjusting the volume. Only the first one is called during the program’s execution. I wrote a second program in which only the second function is called. We will see how quickly each program executes on different architectures.

The first function just takes in a 16-bit signed integer (which would be a single sound sample taken from the file) and a floating point value (which is the scaling factor by which the sample should be multiplied) and returns the product.

The second function, when called for the first time, allocates memory for 65536 16-bit integers and fills it with an array (which will be our lookup table) containing every possible product of a 16-bit signed integer (from -32768 to 32767) and the given scaling factor parameter (I should point out here that the program has no error checking, and that the floating point parameter should be between 0.0 and 1.0 or big trouble may happen). Then, on subsequent calls, the function, instead of performing the multiplication, will simply look up the value in the lookup table.

Each program will open the sound.raw file (the filename of which is to be supplied as a command-line parameter) and separate it into its component 16-bit integers, which will be stored in an array. Then, the program will iterate over the entire array and call the adjustVolume function on each 16-bit int, and finally it will write the adjusted set of integers to a newly-created output.raw file.

I am using the clock() function from time.h to measure the number of processor cycles and the time the program takes to adjust the volume of each sample.

So, we’ll be using here three programs:

first.c, whose source is posted above and which does the multiplication every time the adjust volume function is called;

second.c (which is just first.c with line 74 commented out and line 75 uncommented), which creates a lookup table for a new scaling factor and pulls the result from it when the adjust volume function is called;

and none.c, which just returns the sample without doing anything to it (this should give us an idea of the time it would take just to go through each sample without performing any operation).

We’ll compile each source file with these options:

gcc -g -O0 first.c -o first
gcc -g -O0 second.c -o second
gcc -g -O0 none.c -o none

The -O0 flag is the important one. It tells the compiler to perform only minimal optimizations when converting the source into machine code. This should ensure that the compiler doesn’t work any of its serious optimizing magic that may compromise our attempts to evaluate the algorithms’ efficiency (minor optimizations are still done even with the -O0 option, but they shouldn’t alter our program’s logic and run time in any significant way (or at least I hope not)).

Let’s run each program on each system (my x86_64 computer and the aarchie and betty AArch64 servers) and compare results:

x86_64:

[andrei@localhost Lab5]$ ./first sound.raw
Number of samples from sound file: 25583616
125165 cycles (0.125165 seconds)
[andrei@localhost Lab5]$ ./second sound.raw
Creating lookup table...
Took 240 cycles (0.000240 seconds) to create the lookup table.
Number of samples from sound file: 25583616
146665 cycles (0.146665 seconds)
[andrei@localhost Lab5]$ ./none sound.raw
Number of samples from sound file: 25583616
112892 cycles (0.112892 seconds)

AArch64 (aarchie):

[atopala@aarchie Lab5]$ ./first sound.raw
Number of samples from sound file: 25583616
6070000 cycles (6.070000 seconds)
[atopala@aarchie Lab5]$ ./second sound.raw
Creating lookup table...
Took 20000 cycles (0.020000 seconds) to create the lookup table.
Number of samples from sound file: 25583616
2750000 cycles (2.750000 seconds)
[atopala@aarchie Lab5]$ ./none sound.raw
Number of samples from sound file: 25583616
1560000 cycles (1.560000 seconds)

AArch64 (betty):

[atopala@betty Lab5]$ ./first sound.raw
Number of samples from sound file: 25583616
520000 cycles (0.520000 seconds)
[atopala@betty Lab5]$ ./second sound.raw
Creating lookup table...
Took 0 cycles (0.000000 seconds) to create the lookup table.
Number of samples from sound file: 25583616
290000 cycles (0.290000 seconds)
[atopala@betty Lab5]$ ./none sound.raw
Number of samples from sound file: 25583616
220000 cycles (0.220000 seconds)

All programs produced an output.raw file that is just the sound.raw audio at half volume.

What we first notice, I think, is that aarchie and betty give us very round numbers. I think that the number of cycles the clock() function returns seems to be rounded (or counted?) to the nearest 10,000th. I am not sure if this is a characteristic of AArch64 processors in general, or just the processors of aarchie and betty … or it may be something to do with the gcc compiler on aarchie and betty.

Anyway, I ran the programs on each system multiple times, and the results were relatively constant (on x86_64, the results varied by a few milliseconds, but a program never jumped ahead of or fell behind the others; on AArch64, the results were almost always the same, only very rarely varying by 10,000 cycles), so these, I think, are good results to analyze.

On x86_64, the lookup table method was actually slower than the multiplication, which I did not expect. I did some research and found that multiplication, even though it’s slower than addition, is still one of the faster processor operations. Maybe having to calculate the array index (since the range of our array is 0 to 65536 and the range of our 16-bit signed integer sample is -32768 to 32767, we had to add 32768 to the sample every time to find its corresponding array index) and then also access its data (array access, it seems, is not much faster than multiplication on modern processors nowadays; apparently this was not the case 20 years ago) took enough time that the function fell behind the multiplication method; or maybe having to access memory to get to the array is slower than simply multiplying the numbers on the processor directly.

On AArch64, the lookup table method was significantly faster. I am not sure whether this is an architectural difference between x86_64 and AArch64 or whether aarchie and betty just have slower processors and faster memory. Regardless, the takeaway here is that the best approach for a problem depends on the situation and on the architecture.

(It also startled me to see just how much faster betty is than aarchie … Just to be sure the discrepancy wasn’t due to something in my program, I used the time command to compare how long each system took to copy a 50MB file (the very same sound.raw file we used in the programs), and betty was still several times faster than aarchie … At this rate veronica will be a supercomputer …)

In conclusion, the lookup table approach may be better than the regular approach depending on the architecture and on the system. Furthermore, a lookup table will be more useful if it is replacing the repeated execution of more expensive operations (if we were doing divisions or other such operations, which are much more costly than multiplication, the lookup table approach would probably be faster on most systems). Also, if the data in the lookup table is distributed in such a way that the memory will be accessed linearly, it might also be faster. A lookup table may be more useful if it is smaller, so that it can fit in the cache, which takes less time to access than memory.

Compiled C Lab

So it was that we wrote a simple Hello World program. And it was good. Then we compiled it:

gcc -g -O0 -fno-builtin hello.c -o hello

And it did compile. Now, the -g enables debugging information, the -O0 tells the compiler not to optimize (numbers greater than zero would have optimized the program: -O1 would do basic optimization, -O2 would do all the safe optimizations, and -O3 would perform aggressive optimization), and the -fno-builtin tells it not to use builtin function optimizations. So, we got an unoptimized binary ELF file with debugging information. Let’s examine it!

We were told to use objdump with the -f, -s, -d, or –source options in order to examine the file. We did. It was a good file.

Now, we were to use objdump to examine the changes that would occur for the following:

  1. Adding the compiler option -static:
    The resulting binary file was much larger than the original. The cause of this was that the -static option brought in all the environment variables and libraries needed to run the program.
  2. Removing the compiler option -fno-builtinMost noticeable was that calls to the printf function in the ELF file were instead calls to the puts function. This is a function optimization based on the fact that the puts function is faster than printf. When printf is supposed just to output a string without any variables, the compiler calls puts instead, which skips past the whole hassle of formatting the output string and just puts it there.
  3. Removing the compiler option -gWithout debugging information, the resultant binary file was significantly smaller. The ELF did not have the .debug_ sections that the first file had.
  4. Adding additional arguments to the printf() function.The first few arguments are stored in the registers while the rest are stored on the stack.
  5. Moving the printf() call to a separate function called output().we saw that the <main> section was smaller and had only the call to output. The <output> section was identical to the original <main> section, that is, it stored the “Hello World” string in memory and called the printf() function.
  6. Removing -O0 and adding -O3.The binary file was even larger. We are optimizing for speed, then, not size. The disassembled code was quite different. The compiler performed several optimizations to make the code execute faster. For example, the unoptimized ELF uses mov $0x0,%eax to clear the register %eax, while the optimized file does xor %eax,%eax.

Assembler Lab

There are three servers that we will use for the course. The first two, aarchie and betty (soon also jughead?), are ARM8 AArch64 systems. The third, xerxes (the Great?), is an x86_64 server. In groups, we examined the result of compiling C versions and assembly language versions of a “Hello World” on each of the two different architectures.

We used objdump -d hello to examine the instructions of the binary files created from the C source. Both versions of the program–the one compiled on x86_64 and the one compiled on aarch64–were fairly similar. There was a large amount of code, and the <main> sections, with which we were concerned, were about the same. The mnemonics were different, and so were some of the instructions, but the gist of was that there was a location in memory in which the message was, and then a call or branch to <printf@plt>.

For the assembler source, we compiled the files on both architectures, as before, and examined them with objdump -d hello. Instead of a whole bunch of code, like we had with the compiled C programs, we saw only a very small amount. aarchie’s file has the format elf64-littleaarch64, while xerxes’ had elf64-x86-64. Then there was a disassembly of section .text; both files had their own instructions.

Now, having examined the structure of the assembler code, we were given the task of creating a simple loop that would output the word “Loop” and each number from 1 to 9 (i.e. “Loop: 1,” “Loop: 2,” etc.). Printing the word was simple, and printing the number was simple, but where our group ran into trouble was printing the word and the loop. We had the string in one register and the ASCII number in another, and we were trying to combine the two. Our teacher, however, told us that we could simply write the ASCII number to the location in memory where the string was stored, plus the proper offset. Once set on the right track, we finished the project without trouble.

Looping it up to 30 was also relatively simple. To suppress the tens digit for the first 10 numbers, we moved the iterator to a new register and divided the number by 10. Then we checked the tens digit, compared it to 0, and, if it was 0, jumped over the part of the code that would have printed it.

Writing in assembler is, I feel, more difficult than writing in higher-level programming languages because of the need to manipulate and keep track of the registers, and because we are dealing with specific locations in memory. Something as simple as printing a counting loop, which would be trivial in most high-level programming languages, takes a decent amount of work and requires attention.

The precision and the level of control the programmer has, however, seems like a serious advantage, and I am sure there are techniques for optimization that cannot be done in a high-level language, and that require the use of assembler code.

 

Code Building Lab

Now we’ll try to build two packages that have different licenses.

Let’s take a look at the process of building a simple software package from the Free Software Foundation’s GNU Project. Going through the blurbs for all the official GNU packages, I come across something entitled “Gcal,” a program that can calculate and print calendars. Our current operating system, Fedora 23, does come installed with the standard cal program available on most Unix machines, but gcal has several extended features, including the ability to the calculate astronomical data such as the phases of the moon, which is information that can be used to plan the harvest season and to perform magick rituals—a must for any computer programmer. Let’s go ahead and download the gcal package.

In the downloads directory, we have now a file labeled gcal-4.tar.gz. The .tar extension indicates that it is a tarfile (“tar” being short for tape archive), that is, a group of files bundled together with the tar command; and the .gz extension indicates that it has been compressed with the gzip command. To build the package, first we must unpack the archive by using the tar command. Let’s run the following command:

tar -xzf gcal-4.tar.gz

The x option tells the tar command that we need to extract something, the z option tells it that we need to filter the archive through gzip, and the f option tells it that we are using a file, specifically the gcal-4.tar.gz file. Running the command creates and populates a new folder by the name of gcal-4. (This name, by the way, follows the naming convention common to most GNU packages: package_name-version.)

Inside the folder, there is an INSTALL file which tells us that the ‘configure’ shell script will try to find correct values for system-dependent variables that are to be used during compilation, and will create a ‘Makefile’ which we may then use to compile the package. Let’s go ahead and run the script.

./configure

Alas, we are accosted by an error:

configure: error: no acceptable C compiler found in $PATH

It seems our newly-installed operating system is missing a C compiler. It’s simple enough to install one:

sudo dnf install gcc

GCC is the GNU Compiler Collection; it should give us what we need. Let’s run the configure script again:

./configure

A long list of statements cascades across the terminal, each announcing that it has checked something. A few seconds pass and we are told that all the checks are done. We find now in the same folder a Makefile. We’ll invoke the make command, which will use the makefile to convert the source code of the package into executable binaries:

make

Another list of statements informs us of the steps taken to build the package, and at the end of it we find that there is a new gcal file inside the src directory. Let’s try to run it.

./gcal

screen6

The program performs as advertised!

Unfortunately, the documentation for gcal is a little overwhelming at the moment, so I won’t explore all of its functions in this post, but here’s a list of this year’s astronomical events (using GMT; this can be configured):

[andrei@localhost src]$ ./gcal -n --astronomical-holidays
Eternal holiday list: The year 2016 is A leap year
New Moon 01:30 (Ast) - Sun, Jan 10th 2016 = -19 days
Waxing Half Moon 23:26 (Ast) - Sat, Jan 16th 2016 = -13 days
Full Moon 01:46 (Ast) - Sun, Jan 24th 2016 = -5 days
Waning Half Moon 03:28 (Ast) - Mon, Feb 1st 2016 = +3 days
New Moon 14:39 (Ast) - Mon, Feb 8th 2016 = +10 days
Waxing Half Moon 07:46 (Ast) - Mon, Feb 15th 2016 = +17 days
Full Moon 18:20 (Ast) - Mon, Feb 22nd 2016 = +24 days
Waning Half Moon 23:11 (Ast) - Tue, Mar 1st 2016 = +32 days
New Moon 01:54 (Ast) - Wed, Mar 9th 2016 = +40 days
Solar Eclipse/Total 01:57 (Ast) - Wed, Mar 9th 2016 = +40 days
Waxing Half Moon 17:03 (Ast) - Tue, Mar 15th 2016 = +46 days
Equinox Day 04:30 (Ast) - Sun, Mar 20th 2016 = +51 days
Full Moon 12:01 (Ast) - Wed, Mar 23rd 2016 = +54 days
Lunar Eclipse/Penumbral 11:47 (Ast) - Wed, Mar 23rd 2016 = +54 days

And then, inside the data/dates directory, we find some more important information:

screen8

Let’s try to build something a little more complicated now. We’ll try to build NetHack, an ASCII dungeon-crawling game (which, coincidentally, just last month released its first major update in twelve years). NetHack is a popular Roguelike (meaning like Rogue) game that we can run in our terminal.

We download the file nethack-360-src.tgz from the official site. The .tgz extension is the same as the tar.gz extension we dealt with earlier. We extract the tarball using the same command.

tar xzf nethack-360-src.tgz

Inside the folder in which the archive has been extracted, there is a README file, which directs us to the sys/unix/Install.unx file, which then directs us to the newer sys/unix/NewInstall.unx. There, we learn that the sys/unix/hints directory contains a file which will help us configure the package. From the sys/unix directory, we run the command

sh setup.sh hints/linux

and everything seems to be okay. Nothing is printed to the terminal, which is usually a good sign. Next, we’re to go to the top directory and run the command

make all

Note that we’re doing this without having had installed any of the package’s dependencies. We’ll see where it gets us and we’ll deal with trouble as it comes.

Several files seem to compile properly, but then we’re met with the following error:

../win/tty/termcap.c:852:20: fatal error: curses.h: No such file or directory
compilation terminated.

The curses.h file can be found in the ncurses-devel package, which is a library containing developer’s tools for ncurses, which, according to the GNU website, “is software for controlling writing to the console screen under Unix, Linux and other operating systems.” Let’s install ncurses-devel:

sudo dnf install ncurses-devel

Now, if we run sudo find / -name curses.h (sudo is needed in order to traverse directories forbidden to us), we see that we now have multiple curses.h files:

find: /usr/include/ncursesw/curses.h
/usr/include/curses.h
/usr/include/ncurses/curses.h

Let’s try to compile NetHack again. The NewInstall.unx file told us to remove all the generated files before retrying to build the package, so from the top level directory (ie. The one into which the tarball had been extracted) we run this command:

make spotless

And then:

make all

Another error:

yacc -d dgn_comp.y
make[1]: yacc: Command not found
Makefile:327: recipe for target 'dgn_yacc.c' failed
make[1]: *** [dgn_yacc.c] Error 127
make[1]: Leaving directory '/home/andrei/Downloads/nethack-3.6.0/util'
Makefile:191: recipe for target 'dungeon' failed
make: *** [dungeon] Error 2

Yacc is a parser generator which, according to the Install.unx file, is being used here to produce dungeon and level compilers. A GNU tool called Bison is compatible with Yacc, and can do the same thing. Let’s install Bison:

sudo dnf install bison

We’ll also need the program flex, which is a lexical analyzer that can tokenize input data to be used by Bison. (There is a lot more to yacc and lex (and Bison and flex), but for the purposes of building the NetHack package, this is all we need to know.)

sudo dnf install flex

We must also edit the sys/unix/Makefile.utl file:

# yacc/lex programs to use to generate *_comp.h, *_lex.c, and *_yacc.c.

# if, instead of yacc/lex you have bison/flex, comment/uncomment the following.

YACC = yacc

LEX = lex

# YACC = bison -y

# YACC = byacc

# LEX = flex

Let’s comment out the yacc and lex lines and uncomment the bison -y and flex lines.

Now, since we’ve edited Makefile.utl, we need to run the setup script again. In sys/unix:

sh setup.sh hints/linux

And then, in the top directory:

make all

This time, there are no errors; we are told that everything is done. Indeed, if we look inside the src directory, we’ll find a new executable file entitled nethack. Let’s run it.

./nethack
/home/andrei/nh/install/games/lib/nethackdir: No such file or directory
Cannot chdir to /home/andrei/nh/install/games/lib/nethackdir.

It seems it wants to be installed. Let’s run the following command to see what would happen if we installed it:

make -n install

This prints all the steps that make install would go through, but it doesn’t actually execute them. It appears that the installation would be wholly contained inside the ~/nh/install/games/ directory. I spent a few minutes creating that directory and copying the necessary files into it by hand, and I was able to run the program, but there’s really no need to do that—nor do I think that anyone would be interested in the (very straightforward) process. Let’s just go ahead and install the package:

make install

Now, in the ~/nh/install/games directory there is an executable file called nethack. Let’s go there and run it:

./nethack

nethack4

It works!

nethack5

It is a hard game.