Algorithm Selection Lab

Let’s examine the computational cost of performing lots of multiplication operations versus the computational cost of precomputing every possible result and getting the answers from a lookup table as needed. We’ll use each technique to scale a sequence of sound samples, and we’ll compare the completion times to each other on an x86_64 machine and on two AArch64 servers.

Some mathematical operations are more complex than others, and this translates into longer execution times for programs that use more costly operations. Multiplication, for instance, is more expensive than addition (on Intel CPUs, according to the reference manual, an IMUL operation is at least three times slower than an ADD operation). There is, then, potential for optimization in modifying the algorithms we use so that they depend upon less costly operations.

The algorithm we’ll try to optimize is a very simple one that can be used to adjust the volume of sound samples.

First, a little background. On digital machines, audio is usually represented through a process called pulse-code modulation, in which an analog audio signal is sampled at uniform intervals to produce a series of snapshots of the sound wave. These samples are then encoded as binary data, which can be saved as audio files (wav, mp3, flac, etc.).

500px-4-bit-linear-pcm-svg
A sound wave encoded into 4-bit PCM digital data.

These audio files are, at the most basic level, a collection of numbers representing the amplitude of the sound wave at a given time (it is a bit more complicated than that, and most audio files use complex algorithms for compression and such, but for our purposes now that’s all we need to know). If we were to multiply every one of those numbers by a scaling factor, we would effectively be changing the sound wave’s amplitude, and thus we would be changing its volume. That’s what we’ll be doing here.

We’ll start by using the sox utility to convert a WAV music file (which contains header information that specifies the sample rate, bit rate, encoding type, etc) into a raw audio file:

sox 03\ Over\ The\ Hills\ And\ Far\ Away.wav -t raw sound.raw

This produces a raw 16-bit signed integer PCM file, which is essentially just the sequence of 16-bit integers representing the samples. We can’t play the file as we could the WAV file because audio player software needs to know how to play it (that is, how many bits represent one sample, how many samples should be played per second, etc.). Our raw file, however, is encoded in the same way that the original WAV file is, and we can find this information by using the soxi command:

[andrei@localhost test]$ soxi 03\ Over\ The\ Hills\ And\ Far\ Away.wav 

Input File : '03 Over The Hills And Far Away.wav'
Channels : 2
Sample Rate : 44100
Precision : 16-bit
Duration : 00:04:50.06 = 12791808 samples = 21754.8 CDDA sectors
File Size : 51.2M
Bit Rate : 1.41M
Sample Encoding: 16-bit Signed Integer PCM

Now we can use the play command (which also comes with sox) with these arguments to play the raw sound file:

[andrei@localhost test]$ play -b 16 -r 44100 -c 2 -e signed-integer sound.raw

You can’t hear it so you’ll have to trust me when I say that it sounds exactly the same as the original file.

Now let’s take that sound.raw file and try to reduce its volume by half. I wrote the following program:


#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <stdlib.h>

#define MAX 65536
#define HALF 32768 
#define OUTPUT "output.raw"

int16_t adjustVolume(int16_t sample, float factor) {
 return (int16_t)(sample * factor);
}

int16_t adjustVolume2(int16_t sample, float factor) {
 static int16_t* lookupTable = NULL;
 static float oldFactor = 0;
 if (factor != oldFactor) {
  clock_t t = clock();
  oldFactor = factor;
  if (lookupTable) {
   free(lookupTable);
  }
  lookupTable = (int16_t*)malloc(MAX*sizeof(int16_t));
  int i;
  for (i=0; i<MAX; i++) {
    // Offset i by half the maximum value of an unsigned int16_t to fill the table with all the possible signed int16_t values
    lookupTable[i] = (int16_t)((i-HALF) * factor);
  }
  t = clock() - t;
  printf("Took %d cycles (%f seconds) to create the lookup table.\n", t, (float)t/CLOCKS_PER_SEC);
  }
 return lookupTable[sample + HALF];
}

int main(int argc, char *argv[]) {
 int i;
 int j;
 clock_t t;

 // Pointers to input and output files
 FILE *fp;
 FILE *fp2;

 // Array to contain all the samples
 int16_t *samples;

 // The size of the imput file
 long size;
 // Number of sound samples in the input file
 int numSamples;

 float volumeFactor;

 fp = fopen(argv[1], "r");
 fp2 = fopen(OUTPUT, "w");

 // Find the size (in bytes) of the input file
 fseek(fp, 0L, SEEK_END);
 size = ftell(fp);
 fseek(fp, 0L, SEEK_SET);

 // Divide the file size into chunks of int16_t
 numSamples = size / sizeof(int16_t);

 // Read all the samples from the file into memory
 samples = (int16_t*)malloc(numSamples*sizeof(int16_t));
 fread(samples, sizeof(int16_t), numSamples, fp);

 volumeFactor = 0.5;

 printf("Number of samples from sound file: %d\n", numSamples);
 t = clock();
 for (i=0; i<numSamples; i++) {
  samples[i] = adjustVolume(samples[i], volumeFactor);
  // samples[i] = adjustVolume2(samples[i], volumeFactor);
 }
 t = clock() - t;

 printf("%d cycles (%f seconds)\n", t, (float)t/CLOCKS_PER_SEC);

 // Write the adjusted samples to the output file
 fwrite(samples, sizeof(int16_t), numSamples, fp2);

 free(samples);

 fclose(fp);
 fclose(fp2);

 return 0;
}

There are two functions for adjusting the volume. Only the first one is called during the program’s execution. I wrote a second program in which only the second function is called. We will see how quickly each program executes on different architectures.

The first function just takes in a 16-bit signed integer (which would be a single sound sample taken from the file) and a floating point value (which is the scaling factor by which the sample should be multiplied) and returns the product.

The second function, when called for the first time, allocates memory for 65536 16-bit integers and fills it with an array (which will be our lookup table) containing every possible product of a 16-bit signed integer (from -32768 to 32767) and the given scaling factor parameter (I should point out here that the program has no error checking, and that the floating point parameter should be between 0.0 and 1.0 or big trouble may happen). Then, on subsequent calls, the function, instead of performing the multiplication, will simply look up the value in the lookup table.

Each program will open the sound.raw file (the filename of which is to be supplied as a command-line parameter) and separate it into its component 16-bit integers, which will be stored in an array. Then, the program will iterate over the entire array and call the adjustVolume function on each 16-bit int, and finally it will write the adjusted set of integers to a newly-created output.raw file.

I am using the clock() function from time.h to measure the number of processor cycles and the time the program takes to adjust the volume of each sample.

So, we’ll be using here three programs:

first.c, whose source is posted above and which does the multiplication every time the adjust volume function is called;

second.c (which is just first.c with line 74 commented out and line 75 uncommented), which creates a lookup table for a new scaling factor and pulls the result from it when the adjust volume function is called;

and none.c, which just returns the sample without doing anything to it (this should give us an idea of the time it would take just to go through each sample without performing any operation).

We’ll compile each source file with these options:

gcc -g -O0 first.c -o first
gcc -g -O0 second.c -o second
gcc -g -O0 none.c -o none

The -O0 flag is the important one. It tells the compiler to perform only minimal optimizations when converting the source into machine code. This should ensure that the compiler doesn’t work any of its serious optimizing magic that may compromise our attempts to evaluate the algorithms’ efficiency (minor optimizations are still done even with the -O0 option, but they shouldn’t alter our program’s logic and run time in any significant way (or at least I hope not)).

Let’s run each program on each system (my x86_64 computer and the aarchie and betty AArch64 servers) and compare results:

x86_64:

[andrei@localhost Lab5]$ ./first sound.raw
Number of samples from sound file: 25583616
125165 cycles (0.125165 seconds)
[andrei@localhost Lab5]$ ./second sound.raw
Creating lookup table...
Took 240 cycles (0.000240 seconds) to create the lookup table.
Number of samples from sound file: 25583616
146665 cycles (0.146665 seconds)
[andrei@localhost Lab5]$ ./none sound.raw
Number of samples from sound file: 25583616
112892 cycles (0.112892 seconds)

AArch64 (aarchie):

[atopala@aarchie Lab5]$ ./first sound.raw
Number of samples from sound file: 25583616
6070000 cycles (6.070000 seconds)
[atopala@aarchie Lab5]$ ./second sound.raw
Creating lookup table...
Took 20000 cycles (0.020000 seconds) to create the lookup table.
Number of samples from sound file: 25583616
2750000 cycles (2.750000 seconds)
[atopala@aarchie Lab5]$ ./none sound.raw
Number of samples from sound file: 25583616
1560000 cycles (1.560000 seconds)

AArch64 (betty):

[atopala@betty Lab5]$ ./first sound.raw
Number of samples from sound file: 25583616
520000 cycles (0.520000 seconds)
[atopala@betty Lab5]$ ./second sound.raw
Creating lookup table...
Took 0 cycles (0.000000 seconds) to create the lookup table.
Number of samples from sound file: 25583616
290000 cycles (0.290000 seconds)
[atopala@betty Lab5]$ ./none sound.raw
Number of samples from sound file: 25583616
220000 cycles (0.220000 seconds)

All programs produced an output.raw file that is just the sound.raw audio at half volume.

What we first notice, I think, is that aarchie and betty give us very round numbers. I think that the number of cycles the clock() function returns seems to be rounded (or counted?) to the nearest 10,000th. I am not sure if this is a characteristic of AArch64 processors in general, or just the processors of aarchie and betty … or it may be something to do with the gcc compiler on aarchie and betty.

Anyway, I ran the programs on each system multiple times, and the results were relatively constant (on x86_64, the results varied by a few milliseconds, but a program never jumped ahead of or fell behind the others; on AArch64, the results were almost always the same, only very rarely varying by 10,000 cycles), so these, I think, are good results to analyze.

On x86_64, the lookup table method was actually slower than the multiplication, which I did not expect. I did some research and found that multiplication, even though it’s slower than addition, is still one of the faster processor operations. Maybe having to calculate the array index (since the range of our array is 0 to 65536 and the range of our 16-bit signed integer sample is -32768 to 32767, we had to add 32768 to the sample every time to find its corresponding array index) and then also access its data (array access, it seems, is not much faster than multiplication on modern processors nowadays; apparently this was not the case 20 years ago) took enough time that the function fell behind the multiplication method; or maybe having to access memory to get to the array is slower than simply multiplying the numbers on the processor directly.

On AArch64, the lookup table method was significantly faster. I am not sure whether this is an architectural difference between x86_64 and AArch64 or whether aarchie and betty just have slower processors and faster memory. Regardless, the takeaway here is that the best approach for a problem depends on the situation and on the architecture.

(It also startled me to see just how much faster betty is than aarchie … Just to be sure the discrepancy wasn’t due to something in my program, I used the time command to compare how long each system took to copy a 50MB file (the very same sound.raw file we used in the programs), and betty was still several times faster than aarchie … At this rate veronica will be a supercomputer …)

In conclusion, the lookup table approach may be better than the regular approach depending on the architecture and on the system. Furthermore, a lookup table will be more useful if it is replacing the repeated execution of more expensive operations (if we were doing divisions or other such operations, which are much more costly than multiplication, the lookup table approach would probably be faster on most systems). Also, if the data in the lookup table is distributed in such a way that the memory will be accessed linearly, it might also be faster. A lookup table may be more useful if it is smaller, so that it can fit in the cache, which takes less time to access than memory.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s