# Vectorization Lab

Edit: This should be all okay now. (Sorry for posting this blog post incomplete–I had to watch Leo win his Oscar).

Here we’ll look at single instruction/multiple data (SIMD) vectorization.

A vector is a a series of equal-length data elements on which operations can be performed. SIMD instructions, as the name suggest, are used to perform an operation in parallel on multiple operands. Vectorization, then, is the process of converting parts of a program to a vector implementation so that SIMD instructions can be used in order to minimize processing time. A compiler can perform auto-vectorization during a source program’s compilation in order to optimize the resultant machine code so that it uses SIMD instructions, which can, depending on the size of data being processed, affect appreciably the running time of the program.

Consider, for instance, the following piece of code:

```for (i=0; i<1000; i++) {
arrSum[i] = arr1[i] + arr2[i];
}
```

That loop has potential for vectorization. Instead of performing the addition on each pair of elements one by one, the compiler can use a vector implementation to process multiple pairs of elements at once so that the loop runs fewer times.

To invoke auto-vectorization, we need to give gcc the -O3 option, which tells it to go all out with optimization.

```gcc -g -O3 lab6.c -o lab6
```

Here is the source code itself:

```#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
int arr1;
int arr2;
int arrSum;
long int sum;

int i;
srand(time(NULL));

for (i=0; i<1000; i++) {
// The maximum value of rand() is the same as the capacity of int, so we halve the random number to avoid overflowing during addition
arr1[i] = rand() / 2;
arr2[i] = rand() / 2;
}

for (i=0; i<1000; i++) {
arrSum[i] = arr1[i] + arr2[i];
}

for (i=0; i<1000; i++) {
sum += arrSum[i];
}

printf("%ld\n", sum);

return 0;
}
```

Now let’s look at the machine code disassembly. I’ve extracted the main section of the code and annotated it to explain what is happening as I best could:

```0000000000400550 <main>;
// Here we reserve space on the stack for local variables
400550:	d1400bff 	sub	sp, sp, #0x2, lsl #12
400554:	d13b83ff 	sub	sp, sp, #0xee0
// Here we are storing a pair of double words to a location on the stack; these two registers are to be used as the frame and link registers
400558:	a9bd7bfd 	stp	x29, x30, [sp,#-48]!
// Here we move the current stack pointer to the frame register
40055c:	910003fd 	mov	x29, sp
400560:	d2800000 	mov	x0, #0x0                   	// #0
// Again storing a pair of double words
400564:	a90153f3 	stp	x19, x20, [sp,#16]
// Store now a single register to a location on the stack
400568:	f90013f5 	str	x21, [sp,#32]
// Here the program branches to the time and srand functions to initialize the random seed
40056c:	97ffffdd 	bl	4004e0 <time@plt>
400570:	97fffff0 	bl	400530 <srand@plt>
// Here the first loop begins, in which every element of the two arrays is set to a random number
400574:	d281fa01 	mov	x1, #0xfd0                 	// #4048
400578:	d2800013 	mov	x19, #0x0                   	// #0
40057c:	9100c3b5 	add	x21, x29, #0x30
400580:	8b1d0034 	add	x20, x1, x29
400584:	97ffffdf 	bl	400500 <rand@plt>
400588:	0b407c00 	add	w0, w0, w0, lsr #31
40058c:	13017c00 	asr	w0, w0, #1
400590:	b8356a60 	str	w0, [x19,x21]
400594:	97ffffdb 	bl	400500 <rand@plt>
400598:	0b407c00 	add	w0, w0, w0, lsr #31
40059c:	13017c00 	asr	w0, w0, #1
4005a0:	b8346a60 	str	w0, [x19,x20]
// In the first loop, the counter is incremented by 4 and compared to 4000, so the loop runs 1000 times
4005a4:	91001273 	add	x19, x19, #0x4
4005a8:	f13e827f 	cmp	x19, #0xfa0
4005ac:	54fffec1 	b.ne	400584 <main+0x34>
4005b0:	d2800000 	mov	x0, #0x0                   	// #0
// This is the entry point for the second loop, after the counter is set to 0
// We add 48 to r29 (which contains the address of the stack pointer where our arrays begin) and store that in a register to be used for arguments and return values
4005b4:	9100c3a3 	add	x3, x29, #0x30
// Then we add the loop counter to the register and store that in another register
4005b8:	8b000062 	add	x2, x3, x0
// Now we add 4048 to the frame register and store it in r3
// We are effectively jumping over the first 4000 bytes (ie the first 1000-element int array) in the stack
4005bc:	913f43a3 	add	x3, x29, #0xfd0
// And again we add the loop counter to r3 and store it this time in r1
4005c0:	8b000061 	add	x1, x3, x0
// r1 and r2, I think, are used to move around the memory space to read from and write to the three arrays we are using; essentially we're using r0, which is the loop counter, as the offset from the two starting locations of the two arrays we are using
// Here we load 4 single words from the memory addressed by r2
4005c4:	4c407841 	ld1	{v1.4s}, [x2]
// Now we're moving 8048 to r2, which we will use to jump ahead and find the third array
4005c8:	d283ee02 	mov	x2, #0x1f70                	// #8048
// Now we load 4 single words from the memory addressed by r1
4005cc:	4c407820 	ld1	{v0.4s}, [x1]
// Then add the stack register's value to r2 (effectively setting r2 to the memory address of the third array)...
4005d0:	8b1d0042 	add	x2, x2, x29
// ...and the loop counter to r2 and store it in r1 (so, we store in r1 the address of the nth element in the third array, where n is the loop counter)
4005d4:	8b000041 	add	x1, x2, x0
// This is what it's all for: vector addition
4005d8:	4ea08420 	add	v0.4s, v1.4s, v0.4s
// In the second loop, the counter is incremented by 16 and compared to 4000, so the loop runs only 250 times (and since we are processing 4 numbers each iteration, that's 1000 numbers in total)
4005dc:	91004000 	add	x0, x0, #0x10
// Now we store the 4 sums to the memory addressed by r1 (that is, to the four consecutive elements starting from the nth in the third array, where n is the loop counter)
4005e0:	4c007820 	st1	{v0.4s}, [x1]
// This is where the loop counter is compared to 4000
4005e4:	f13e801f 	cmp	x0, #0xfa0
4005e8:	54fffe61 	b.ne	4005b4 <main+0x64>
// We will move ahead another 4000 bytes, so we pass over the third array
4005ec:	d285e201 	mov	x1, #0x2f10                	// #12048
// r2, at this point, should, I think, hold the memory address of the start of the third array (ie the array whose elements we are now summing)
// We move that value to r0
4005f0:	aa0203e0 	mov	x0, x2
// We add the frame register's value to r1, which will now point to the empty memory space after the three arrays
4005f4:	8b1d0021 	add	x1, x1, x29
// Move immediate: each of the four single words here are set to 0
4005f8:	4f000401 	movi	v1.4s, #0x0
// r0 points to the third array, so we load four single words of it; then r0 is incremented by 16 bytes (which is the total number of bytes we're working with in this instruction (4 single words x 32 bits per word = 128 bits = 16 bytes))
// This is also the point to which the loop will return on successive iterations
4005fc:	4cdf7800 	ld1	{v0.4s}, [x0], #16
// Signed integer lengthen
// Two of the single words in vector 0 are made double words and put into vector 2
400600:	0f20a402 	sxtl	v2.2d, v0.2s
// Then the two double words in vector 2 are added to the two double words in vector 1
400604:	4ee18441 	add	v1.2d, v2.2d, v1.2d
// Now the second part of vector 0 is made into double words
400608:	4f20a400 	sxtl2	v0.2d, v0.4s
// Here we check to see if r0 points to the same memory address as r1 (that is, if we've gone through all the elements of the third array and reached the empty space beyond)
40060c:	eb01001f 	cmp	x0, x1
// And now vector 0 (which contains what were the latter two single words of vector 0 and are now two double words) is added to vector 1
// At the end of the loop, vector 1 contains two numbers which, when added together, will equal the total sum of all the numbers in the third array
400610:	4ee18401 	add	v1.2d, v0.2d, v1.2d
400614:	54ffff41 	b.ne	4005fc <main+0xac>
// Scalar reduce; we add the two numbers in vector 1 together into a double-precision register, which is the same register and contains the sum of the two doublewords it contained before
400618:	5ef1b821 	addp	d1, v1.2d
40061c:	90000000 	adrp	x0, 400000 <_init-0x4a0>
// Move the first and second 64-bit elements from vector 1 to two separate registers
// This might be so that they can be used as arguments for printf?
400620:	4e083c21 	mov	x1, v1.d
400624:	4e183c22 	mov	x2, v1.d
400628:	91214000 	add	x0, x0, #0x850
40062c:	97ffffc5 	bl	400540 <printf@plt>
400630:	a94153f3 	ldp	x19, x20, [sp,#16]
400634:	f94013f5 	ldr	x21, [sp,#32]
400638:	a8c37bfd 	ldp	x29, x30, [sp],#48
40063c:	52800000 	mov	w0, #0x0                   	// #0
400640:	913b83ff 	add	sp, sp, #0xee0
400644:	91400bff 	add	sp, sp, #0x2, lsl #12
400648:	d65f03c0 	ret
40064c:	00000000 	.inst	0x00000000 ; undefined
```

The important instructions for vector addition are the following:

```LD1 {Vt.<T>}, [base]
ADD Vd.<T>, Vn.<T>, Vm.<T>
ST1 {Vt.<T>}, [base]
```

These instructions load into the vector register multiple 1-element structures from the memory addressed by the base register; add the corresponding elements of the two vectors and store the sum in another vector; and store multiple 1-element elements from the vector to the memory addressed by the base register.

Then, to add together all the numbers in an array to a single long int, further instructions are used:

```SXTL Vd.<Td>, Vn.<Ts>
SXTL2 Vd.<Td>, Vn.<Ts>
ADDP Dd, Vn.2D
```

These take the signed integers in the first half of the source vector, lengthen them (eg converts two single words into two double words), and store them in the destination vector; do the same to the second half of the source vector; and add together the two double-word elements in the source vector and store them in the destination register.

If our Algorithm Selection Lab were to use SIMD instructions, it would probably use the MUL instruction, which can multiply the integers in two vectors. However, the second operand we need there is a floating-point number (that is, the volume scaling factor); vector multiplication can’t be performed on vectors with elements of different types, so we would need to convert the floating-point number to an integer in order to use it in a vector multiplication. This, I think, would be done with the FCVTxS instruction, which converts a floating-point vector to a signed integer vector of the same size. Then we could use the SXTL instructions to lengthen the half-word (16-bit) integer vector so that the elements are the same size as the integer vector containing the word (32-bit) integers converted from the floating point. Then we could finally use the MUL instruction to multiply them together. (I am not sure how the 32-bit integer converted from the floating point of value ranging from 0.000 to 1.000 would look, so I don’t know if there is the potential for overflowing past the 32-bit limit when we multiply, but since the other operand was initially only 16 bits it should be okay, I think).

Analyzing the disassembly listing was pretty difficult, mainly because I am still at that stage where the assembler instructions don’t really mean anything to me in and of themselves, so, as with a foreign language of which you have only a tenuous grasp, I have to translate each instruction in my head, and sometimes reach for a dictionary (the ARMv8 Instruction Set Overview seems esoteric and scary at first glance, but it is very helpful). Reading a disassembly listing is like being lost in a strange land and getting directions from someone who speaks only that language, and this person insists on giving very, very detailed directions.
Another challenge was keeping track of what is stored in each register. Since most of the registers are general-purpose, and all you have to go on is the register’s number and the width of what’s inside, it’s easy to get them all mixed up. Kind of makes you appreciate the ability to give variables really long and descriptive names in higher-level programming languages.

Anyway, auto-vectorization is neat and it’s easy to see how SIMD could be very useful when processing operations.