GCC Compiler Options: -funsafe-math-optimizations and -ftracer

GCC contains several flags that can be set in order to guide the optimization of a file during compilation.

Let’s look at two of them:

-funsafe-math-optimizations

The gcc manual says that this option “allows optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards.”

Essentially, what this means is that the compiler will take certain shortcuts in calculating the results of floating-point operations, which may result in rounding errors or potentially the program’s malfunctioning (hence the unsafe part of the name).

-funsafe-math-optimizations enables the compiler options -fno-signed-zeros (which ignores the sign of zeroes), -fno-trapping-math (which assumes that the code won’t generate user-visible traps like division by zero, etc), -fassociative-math (which allows re-association of operands) and -freciprocal-math (which allows for the replacing of division with multiplication by reciprocals).

To test this, we’ll use a simple program that adds some fractions together lots of times and prints the result:

#include <stdio.h>
#include <stdlib.h>

int main() {
        double sum = 0.0;
        double d = 0.0;

        int i;

        for (i=1; i<=10000000; i++) {
                d = (float)i/3 + (float)i/7;
                sum += d;
        }

        printf("%.4f\n", sum);

        return 0;
}

We’ll compile it with the -funsafe-math-optimization flag and without:

[atopala@aarchie presentation]$ gcc -O0 test.c -o test
[atopala@aarchie presentation]$ gcc -O0 -funsafe-math-optimizations test.c -o test-math

I used the time command to compare the execution times of each compiled binary. Here is a screenshot of the results:

funsafe-math-optimizations1

The binary that had been compiled with the -funsafe-math-optimization flag was much faster, but the result was off by quite a bit.

Let’s use objdump -d to examine the disassembly of each binary file.

[atopala@aarchie presentation]$ objdump -d test
00000000004005d0 <main>:
  4005d0:       a9bd7bfd        stp     x29, x30, [sp,#-48]!
  4005d4:       910003fd        mov     x29, sp
  4005d8:       580006c0        ldr     x0, 4006b0 <main+0xe0>
  4005dc:       f90017a0        str     x0, [x29,#40]
  4005e0:       58000680        ldr     x0, 4006b0 <main+0xe0>
  4005e4:       f9000fa0        str     x0, [x29,#24]
  4005e8:       52800020        mov     w0, #0x1                        // #1
  4005ec:       b90027a0        str     w0, [x29,#36]
  4005f0:       14000023        b       40067c <main+0xac>
  4005f4:       b94027a0        ldr     w0, [x29,#36]
  4005f8:       1e220000        scvtf   s0, w0
  4005fc:       1e260001        fmov    w1, s0
  400600:       180005c0        ldr     w0, 4006b8 <main+0xe8>
  400604:       1e270021        fmov    s1, w1
  400608:       1e270000        fmov    s0, w0
  40060c:       1e201821        fdiv    s1, s1, s0
  400610:       1e260021        fmov    w1, s1
  400614:       b94027a0        ldr     w0, [x29,#36]
  400618:       1e220001        scvtf   s1, w0
  40061c:       1e260022        fmov    w2, s1
  400620:       180004e0        ldr     w0, 4006bc <main+0xec>
  400624:       1e270040        fmov    s0, w2
  400628:       1e270001        fmov    s1, w0
  40062c:       1e211800        fdiv    s0, s0, s1
  400630:       1e260000        fmov    w0, s0
  400634:       1e270020        fmov    s0, w1
  400638:       1e270001        fmov    s1, w0
  40063c:       1e212800        fadd    s0, s0, s1
  400640:       1e260000        fmov    w0, s0
  400644:       1e270000        fmov    s0, w0
  400648:       1e22c000        fcvt    d0, s0
  40064c:       9e660000        fmov    x0, d0
  400650:       f9000fa0        str     x0, [x29,#24]
  400654:       f94017a1        ldr     x1, [x29,#40]
  400658:       f9400fa0        ldr     x0, [x29,#24]
  40065c:       9e670021        fmov    d1, x1
  400660:       9e670000        fmov    d0, x0
  400664:       1e602821        fadd    d1, d1, d0
  400668:       9e660020        fmov    x0, d1
  40066c:       f90017a0        str     x0, [x29,#40]
  400670:       b94027a0        ldr     w0, [x29,#36]
  400674:       11000400        add     w0, w0, #0x1
  400678:       b90027a0        str     w0, [x29,#36]
  40067c:       b94027a1        ldr     w1, [x29,#36]
  400680:       5292d000        mov     w0, #0x9680                     // #38528
  400684:       72a01300        movk    w0, #0x98, lsl #16
  400688:       6b00003f        cmp     w1, w0
  40068c:       54fffb4d        b.le    4005f4 <main+0x24>
  400690:       90000000        adrp    x0, 400000 <_init-0x3f0>
  400694:       911d8000        add     x0, x0, #0x760
  400698:       fd4017a0        ldr     d0, [x29,#40]
  40069c:       97ffff71        bl      400460 <printf@plt>
  4006a0:       52800000        mov     w0, #0x0                        // #0
  4006a4:       a8c37bfd        ldp     x29, x30, [sp],#48
  4006a8:       d65f03c0        ret
  4006ac:       d503201f        nop
        ...
  4006b8:       40400000        .word   0x40400000
  4006bc:       40e00000        .word   0x40e00000
[atopala@aarchie presentation]$ objdump -d test-math
00000000004005d0 <main>:
  4005d0:       a9bd7bfd        stp     x29, x30, [sp,#-48]!
  4005d4:       910003fd        mov     x29, sp
  4005d8:       58000540        ldr     x0, 400680 <main+0xb0>
  4005dc:       f90017a0        str     x0, [x29,#40]
  4005e0:       58000500        ldr     x0, 400680 <main+0xb0>
  4005e4:       f9000fa0        str     x0, [x29,#24]
  4005e8:       52800020        mov     w0, #0x1                        // #1
  4005ec:       b90027a0        str     w0, [x29,#36]
  4005f0:       14000017        b       40064c <main+0x7c>
  4005f4:       b94027a0        ldr     w0, [x29,#36]
  4005f8:       1e220000        scvtf   s0, w0
  4005fc:       1e260001        fmov    w1, s0
  400600:       180003e0        ldr     w0, 40067c <main+0xac>
  400604:       1e270021        fmov    s1, w1
  400608:       1e270000        fmov    s0, w0
  40060c:       1e200821        fmul    s1, s1, s0
  400610:       1e260020        fmov    w0, s1
  400614:       1e270001        fmov    s1, w0
  400618:       1e22c021        fcvt    d1, s1
  40061c:       9e660020        fmov    x0, d1
  400620:       f9000fa0        str     x0, [x29,#24]
  400624:       f94017a1        ldr     x1, [x29,#40]
  400628:       f9400fa0        ldr     x0, [x29,#24]
  40062c:       9e670020        fmov    d0, x1
  400630:       9e670001        fmov    d1, x0
  400634:       1e612800        fadd    d0, d0, d1
  400638:       9e660000        fmov    x0, d0
  40063c:       f90017a0        str     x0, [x29,#40]
  400640:       b94027a0        ldr     w0, [x29,#36]
  400644:       11000400        add     w0, w0, #0x1
  400648:       b90027a0        str     w0, [x29,#36]
  40064c:       b94027a1        ldr     w1, [x29,#36]
  400650:       5292d000        mov     w0, #0x9680                     // #38528
  400654:       72a01300        movk    w0, #0x98, lsl #16
  400658:       6b00003f        cmp     w1, w0
  40065c:       54fffccd        b.le    4005f4 <main+0x24>
  400660:       90000000        adrp    x0, 400000 <_init-0x3f0>
  400664:       911ca000        add     x0, x0, #0x728
  400668:       fd4017a0        ldr     d0, [x29,#40]
  40066c:       97ffff7d        bl      400460 <printf@plt>
  400670:       52800000        mov     w0, #0x0                        // #0
  400674:       a8c37bfd        ldp     x29, x30, [sp],#48
  400678:       d65f03c0        ret
  40067c:       3ef3cf3d        .word   0x3ef3cf3d
        ...

The only thing that’s changed is that the original equation, which contained two division operations, has been replaced with an equation containing one multiplication operation.
Due to the nature of floating point arithmetic, the number there is an approximation of what would be the algebraic result; there is a tiny difference between the result of the first equation and the result of the second, and, as evidenced by our program, this tiny difference can add up.

The gcc manual says of -funsafe-math-optimizations that “this option is not turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.”

We’ve seen with our program that the code produced with this compiler option can be significantly faster. If precision is not an issue and the data being processed is either small enough or distributed in such a way that that large variations in the result don’t happen (as was the case in our program, where we just continually added to a number 10 million times), it might be okay to use -funsafe-math-optimizations for that extra speed boost–but, again, remember that it is unsafe, and might give unexpected results.

-ftracer

-ftracer, according to the gcc manual, “performs tail duplication to enlarge superblock size. This transformation simplifies the control flow of the function allowing other optimizations to do better job.”
Superblocks are used during trace scheduling, which is an optimization technique that predicts frequent execution paths in a program and tries to make it so that the program takes advantage of parallel processing by queuing up blocks of instructions that don’t depend on each other (ie, one block can begin to be executed without waiting for a branch or result from another). A superblock is a set of instructions in which control can enter only at the top–you can’t begin executing instructions in the middle of a superblock. The larger the superblock, the less time the processor has to wait for the completion of other instructions before proceeding.
Tail duplication is the process of copying common parts of code in order to create larger superblocks. For example, consider the following program:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
        srand(time(NULL));
        int x = rand() % 100;
        int y = rand() % 100;

        if (x == 10) {
                x++;
        }
        else {
                y++;
        }
        x += 10;
        y += 10;
        printf("%d, %d\n", x, y);

        return 0;
}

What tail duplication should do is copy the printf and x+=10 and y+=10 statements that are outside the conditional statement so that they can be placed inside both conditional branches. This creates larger blocks of code that can, ideally, be executed in parallel.

As the gcc manual says, this is especially useful when performing additional optimizations because it allows the compiler to move code blocks around more freely.

Compiling the program with only the -ftracer option doesn’t change it–the binaries produced by gcc test.c and gcc -ftracer test.c are identical. Compiling with -O2 and -ftracer, however, does. In the objdump listing, the file that had been compiled with -ftracer has an extra call to printf in its main section. Moreover, the program has extra instructions inside what would be the section of code that executes when x==0. This seems to imply that tail duplication has taken place.

I did try to build longer, more complicated programs and see what -ftracer does in their compilation, but the results were inconclusive to say the least. -ftracer seemed not to kick in unless at the higher levels of optimization, and those mangled the machine code beyond my ability to analyze it well enough to say anything important about it 😦 But the theory, I think, makes sense, and I imagine that -ftracer could produce faster code (certainly not smaller, on account of the tail duplication) when dealing with much larger and more complicated source files.

Update 18/03/16: So I presented my findings to the class. The professor said that -ftracer is one of the optimization flags that actually make the program worse, but do so in such a way that the intermediate representation is more compliant with other optimizations; so, by itself, it is not very useful, but with further optimizations it may make a significant difference.
It was also suggested that tail merging, which occurs by default at certain optimization levels, might have been thwarting my attempt to obtain a useful result from -ftracer. This, I think, is why it is so difficult to analyze optimized machine code: you’ve not really a clear idea of which optimizations have been applied, and where, and why (also when and how). Anyway, I think (hope) most people understood if not the exact intricacies of the flag’s effects then at least the basic idea behind it.

Here is a link to the presentation slides:
Presentation Slides

Further Reading

GCC documentation – optimizations
Semantics of Floating Point Math in GCC
Floating Point Representation
Superblock Scheduling and Tail Duplication>
The Superblock: An Effective Technique for VLIW and Superscalar Computation
VLIW Processors and Trace Scheduling: Super Block Scheduling

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s