Edit: This should be all okay now. (Sorry for posting this blog post incomplete–I had to watch Leo win his Oscar).
Here we’ll look at single instruction/multiple data (SIMD) vectorization.
A vector is a a series of equal-length data elements on which operations can be performed. SIMD instructions, as the name suggest, are used to perform an operation in parallel on multiple operands. Vectorization, then, is the process of converting parts of a program to a vector implementation so that SIMD instructions can be used in order to minimize processing time. A compiler can perform auto-vectorization during a source program’s compilation in order to optimize the resultant machine code so that it uses SIMD instructions, which can, depending on the size of data being processed, affect appreciably the running time of the program.
Consider, for instance, the following piece of code:
for (i=0; i<1000; i++) { arrSum[i] = arr1[i] + arr2[i]; }
That loop has potential for vectorization. Instead of performing the addition on each pair of elements one by one, the compiler can use a vector implementation to process multiple pairs of elements at once so that the loop runs fewer times.
To invoke auto-vectorization, we need to give gcc the -O3 option, which tells it to go all out with optimization.
gcc -g -O3 lab6.c -o lab6
Here is the source code itself:
#include <stdio.h> #include <stdlib.h> #include <time.h> int main() { int arr1[1000]; int arr2[1000]; int arrSum[1000]; long int sum; int i; srand(time(NULL)); for (i=0; i<1000; i++) { // The maximum value of rand() is the same as the capacity of int, so we halve the random number to avoid overflowing during addition arr1[i] = rand() / 2; arr2[i] = rand() / 2; } for (i=0; i<1000; i++) { arrSum[i] = arr1[i] + arr2[i]; } for (i=0; i<1000; i++) { sum += arrSum[i]; } printf("%ld\n", sum); return 0; }
Now let’s look at the machine code disassembly. I’ve extracted the main section of the code and annotated it to explain what is happening as I best could:
0000000000400550 <main>; // Here we reserve space on the stack for local variables 400550: d1400bff sub sp, sp, #0x2, lsl #12 400554: d13b83ff sub sp, sp, #0xee0 // Here we are storing a pair of double words to a location on the stack; these two registers are to be used as the frame and link registers 400558: a9bd7bfd stp x29, x30, [sp,#-48]! // Here we move the current stack pointer to the frame register 40055c: 910003fd mov x29, sp 400560: d2800000 mov x0, #0x0 // #0 // Again storing a pair of double words 400564: a90153f3 stp x19, x20, [sp,#16] // Store now a single register to a location on the stack 400568: f90013f5 str x21, [sp,#32] // Here the program branches to the time and srand functions to initialize the random seed 40056c: 97ffffdd bl 4004e0 <time@plt> 400570: 97fffff0 bl 400530 <srand@plt> // Here the first loop begins, in which every element of the two arrays is set to a random number 400574: d281fa01 mov x1, #0xfd0 // #4048 400578: d2800013 mov x19, #0x0 // #0 40057c: 9100c3b5 add x21, x29, #0x30 400580: 8b1d0034 add x20, x1, x29 400584: 97ffffdf bl 400500 <rand@plt> 400588: 0b407c00 add w0, w0, w0, lsr #31 40058c: 13017c00 asr w0, w0, #1 400590: b8356a60 str w0, [x19,x21] 400594: 97ffffdb bl 400500 <rand@plt> 400598: 0b407c00 add w0, w0, w0, lsr #31 40059c: 13017c00 asr w0, w0, #1 4005a0: b8346a60 str w0, [x19,x20] // In the first loop, the counter is incremented by 4 and compared to 4000, so the loop runs 1000 times 4005a4: 91001273 add x19, x19, #0x4 4005a8: f13e827f cmp x19, #0xfa0 4005ac: 54fffec1 b.ne 400584 <main+0x34> 4005b0: d2800000 mov x0, #0x0 // #0 // This is the entry point for the second loop, after the counter is set to 0 // We add 48 to r29 (which contains the address of the stack pointer where our arrays begin) and store that in a register to be used for arguments and return values 4005b4: 9100c3a3 add x3, x29, #0x30 // Then we add the loop counter to the register and store that in another register 4005b8: 8b000062 add x2, x3, x0 // Now we add 4048 to the frame register and store it in r3 // We are effectively jumping over the first 4000 bytes (ie the first 1000-element int array) in the stack 4005bc: 913f43a3 add x3, x29, #0xfd0 // And again we add the loop counter to r3 and store it this time in r1 4005c0: 8b000061 add x1, x3, x0 // r1 and r2, I think, are used to move around the memory space to read from and write to the three arrays we are using; essentially we're using r0, which is the loop counter, as the offset from the two starting locations of the two arrays we are using // Here we load 4 single words from the memory addressed by r2 4005c4: 4c407841 ld1 {v1.4s}, [x2] // Now we're moving 8048 to r2, which we will use to jump ahead and find the third array 4005c8: d283ee02 mov x2, #0x1f70 // #8048 // Now we load 4 single words from the memory addressed by r1 4005cc: 4c407820 ld1 {v0.4s}, [x1] // Then add the stack register's value to r2 (effectively setting r2 to the memory address of the third array)... 4005d0: 8b1d0042 add x2, x2, x29 // ...and the loop counter to r2 and store it in r1 (so, we store in r1 the address of the nth element in the third array, where n is the loop counter) 4005d4: 8b000041 add x1, x2, x0 // This is what it's all for: vector addition 4005d8: 4ea08420 add v0.4s, v1.4s, v0.4s // In the second loop, the counter is incremented by 16 and compared to 4000, so the loop runs only 250 times (and since we are processing 4 numbers each iteration, that's 1000 numbers in total) 4005dc: 91004000 add x0, x0, #0x10 // Now we store the 4 sums to the memory addressed by r1 (that is, to the four consecutive elements starting from the nth in the third array, where n is the loop counter) 4005e0: 4c007820 st1 {v0.4s}, [x1] // This is where the loop counter is compared to 4000 4005e4: f13e801f cmp x0, #0xfa0 4005e8: 54fffe61 b.ne 4005b4 <main+0x64> // We will move ahead another 4000 bytes, so we pass over the third array 4005ec: d285e201 mov x1, #0x2f10 // #12048 // r2, at this point, should, I think, hold the memory address of the start of the third array (ie the array whose elements we are now summing) // We move that value to r0 4005f0: aa0203e0 mov x0, x2 // We add the frame register's value to r1, which will now point to the empty memory space after the three arrays 4005f4: 8b1d0021 add x1, x1, x29 // Move immediate: each of the four single words here are set to 0 4005f8: 4f000401 movi v1.4s, #0x0 // r0 points to the third array, so we load four single words of it; then r0 is incremented by 16 bytes (which is the total number of bytes we're working with in this instruction (4 single words x 32 bits per word = 128 bits = 16 bytes)) // This is also the point to which the loop will return on successive iterations 4005fc: 4cdf7800 ld1 {v0.4s}, [x0], #16 // Signed integer lengthen // Two of the single words in vector 0 are made double words and put into vector 2 400600: 0f20a402 sxtl v2.2d, v0.2s // Then the two double words in vector 2 are added to the two double words in vector 1 400604: 4ee18441 add v1.2d, v2.2d, v1.2d // Now the second part of vector 0 is made into double words 400608: 4f20a400 sxtl2 v0.2d, v0.4s // Here we check to see if r0 points to the same memory address as r1 (that is, if we've gone through all the elements of the third array and reached the empty space beyond) 40060c: eb01001f cmp x0, x1 // And now vector 0 (which contains what were the latter two single words of vector 0 and are now two double words) is added to vector 1 // At the end of the loop, vector 1 contains two numbers which, when added together, will equal the total sum of all the numbers in the third array 400610: 4ee18401 add v1.2d, v0.2d, v1.2d 400614: 54ffff41 b.ne 4005fc <main+0xac> // Scalar reduce; we add the two numbers in vector 1 together into a double-precision register, which is the same register and contains the sum of the two doublewords it contained before 400618: 5ef1b821 addp d1, v1.2d 40061c: 90000000 adrp x0, 400000 <_init-0x4a0> // Move the first and second 64-bit elements from vector 1 to two separate registers // This might be so that they can be used as arguments for printf? 400620: 4e083c21 mov x1, v1.d[0] 400624: 4e183c22 mov x2, v1.d[1] 400628: 91214000 add x0, x0, #0x850 40062c: 97ffffc5 bl 400540 <printf@plt> 400630: a94153f3 ldp x19, x20, [sp,#16] 400634: f94013f5 ldr x21, [sp,#32] 400638: a8c37bfd ldp x29, x30, [sp],#48 40063c: 52800000 mov w0, #0x0 // #0 400640: 913b83ff add sp, sp, #0xee0 400644: 91400bff add sp, sp, #0x2, lsl #12 400648: d65f03c0 ret 40064c: 00000000 .inst 0x00000000 ; undefined
The important instructions for vector addition are the following:
LD1 {Vt.<T>}, [base] ADD Vd.<T>, Vn.<T>, Vm.<T> ST1 {Vt.<T>}, [base]
These instructions load into the vector register multiple 1-element structures from the memory addressed by the base register; add the corresponding elements of the two vectors and store the sum in another vector; and store multiple 1-element elements from the vector to the memory addressed by the base register.
Then, to add together all the numbers in an array to a single long int, further instructions are used:
SXTL Vd.<Td>, Vn.<Ts> SXTL2 Vd.<Td>, Vn.<Ts> ADDP Dd, Vn.2D
These take the signed integers in the first half of the source vector, lengthen them (eg converts two single words into two double words), and store them in the destination vector; do the same to the second half of the source vector; and add together the two double-word elements in the source vector and store them in the destination register.
If our Algorithm Selection Lab were to use SIMD instructions, it would probably use the MUL instruction, which can multiply the integers in two vectors. However, the second operand we need there is a floating-point number (that is, the volume scaling factor); vector multiplication can’t be performed on vectors with elements of different types, so we would need to convert the floating-point number to an integer in order to use it in a vector multiplication. This, I think, would be done with the FCVTxS instruction, which converts a floating-point vector to a signed integer vector of the same size. Then we could use the SXTL instructions to lengthen the half-word (16-bit) integer vector so that the elements are the same size as the integer vector containing the word (32-bit) integers converted from the floating point. Then we could finally use the MUL instruction to multiply them together. (I am not sure how the 32-bit integer converted from the floating point of value ranging from 0.000 to 1.000 would look, so I don’t know if there is the potential for overflowing past the 32-bit limit when we multiply, but since the other operand was initially only 16 bits it should be okay, I think).
Analyzing the disassembly listing was pretty difficult, mainly because I am still at that stage where the assembler instructions don’t really mean anything to me in and of themselves, so, as with a foreign language of which you have only a tenuous grasp, I have to translate each instruction in my head, and sometimes reach for a dictionary (the ARMv8 Instruction Set Overview seems esoteric and scary at first glance, but it is very helpful). Reading a disassembly listing is like being lost in a strange land and getting directions from someone who speaks only that language, and this person insists on giving very, very detailed directions.
Another challenge was keeping track of what is stored in each register. Since most of the registers are general-purpose, and all you have to go on is the register’s number and the width of what’s inside, it’s easy to get them all mixed up. Kind of makes you appreciate the ability to give variables really long and descriptive names in higher-level programming languages.
Anyway, auto-vectorization is neat and it’s easy to see how SIMD could be very useful when processing operations.