Monday, November 2, 2015

Aliasing and autovectorization

In the previous article, “Complications” I mentioned an example of a serial algorithm, incrementAll, that supposedly gets converted by the compiler into some code that operates in parallel.

We will see that such a thing happens with two of the best compiler suites of all time, CLANG and GCC.

Modern architectures have vector capabilities. For example, the Intel x86-64 has the SSE extensions, which currently have reached 32 registers of 512 bits (16 single-precision floating point elements). Most x86-64 in use today have 16 256 bit registers. I should say Intel’s SSE is the weirdest SIMD instruction set that I know of and that it has severe design errors, at the opportune moment I hope to comment on this in full, but the most popular architecture in the world today has the vector instruction set that it has and because despite design errors it is highly performing, we ought to learn about it.

Modern compilers also know how to automatically transform serial code into vector operations, even with the notoriously hard SSE. Let us see a version of incrementAll more suitable to illustrate autovectorization:
using BigThing = float;

#include <vector>

// note: passing by value, not const reference ==> no aliasing
void incrementAll(BigThing *ptr, BigThing value) {
    ptr = reinterpret_cast<BigThing *>(__builtin_assume_aligned(ptr, 32));
        // note: it is not safe to make the assumption, done here to prevent
        // the complications of handling the alignment
    for(auto i = 0; i < 1024; ++i) { ptr[i] += value; }
}

There are two changes: An array is passed instead of a vector, which is assumed to be aligned to 32 byte boundaries, and the length of the vector is fixed at 1024 (2^10) elements. Still, the source code as written specifies a serial application of += to each element in the array.

Using the stupendous tool “Online Interactive Compiler Explorer” by M. Godbolt we can readily see what GCC 5.2 does with that code, with color highlighting and all, repeated here:
incrementAll(float*, float):
 vbroadcastss %xmm0, %ymm1
 leaq 4096(%rdi), %rax
.L2:
 vaddps (%rdi), %ymm1, %ymm0
 addq $32, %rdi
 vmovaps %ymm0, -32(%rdi)
 cmpq %rdi, %rax
 jne .L2
 vzeroupper
 ret

Which I can traduce for you if you want:
  • Line 2 means copy the least significant float in vector register 0 into the 8 positions of register ymm1, xmm0 is obviously the argument value, as the x86-64 ABI specifies it should be (first SSE parameter)
  • Line 3 calculates ptr + 4096 and puts it in scalar register rax (rdi is the first scalar parameter, ptr)
  • Line 4 adds 8 floats at a time from *ptr into ymm0 (xmm0 and ymm0 are the same register, just viewed as 16 or 32 bytes)
  • Since the iterations happen 8 floats at a time, the pointer is incremented by 4*8 = 32 in line 6
  • The result is copied back
  • Lines 9-10 control the loop
  • Line 10 is an artifact of the ABI, that it is required to clear the upper 128 bits of all YMM registers
As you can see, only asking GCC to optimize for speed (-03) and let it use AVX, the name for the 256 revision of SSE (-mavx, -mavx2), it goes and transform the serial code into doing 8 floats in parallel.  AVX is not needed, you may remove those options and you'd still get vectorization on 128 bit registers.

Clang is even more radical, because it will also unroll the loop four times, that is, its iterations consume 32 floats!. This is a digression, but I suspect this unrolling is actually a pessimization: The processor has only so many execution units, I don’t think it is capable of performing four 256-wide vector additions in parallel, but even if it is capable, the rules of branch prediction say the execution will loop back, which means that as a matter of course the processor itself will pipeline the execution of many iterations simultaneously, which in effect is executing as many additions in parallel as it is capable, but without unnecessarily longer assembler code that reduces the effectiveness of the code cache.

If the assumption for aligment at 32 bytes is removed, you can also see the output from the ICC, which will also vectorize automatically. So, it is proven that the compilers are capable of automatic vectorization.

Just for kicks, let’s do a very simple change, instead of passing value by value, let us pass it by const reference: Clang simply turns off vectorization.

GCC still vectorizes, but only in the optimistic case of the location of value not within the span of the array:
incrementAll(float*, float const&):
 leaq 4(%rsi), %rdx
 leaq 4096(%rdi), %rax
 cmpq %rdx, %rdi
 jnb .L7
 cmpq %rax, %rsi
 jb .L6
.L7:
 vbroadcastss (%rsi), %ymm1
.L5:
 vaddps (%rdi), %ymm1, %ymm0
 addq $32, %rdi
 vmovaps %ymm0, -32(%rdi)
 cmpq %rax, %rdi
 jne .L5
 vzeroupper
 ret
.L6:
 vmovss (%rdi), %xmm0
 addq $4, %rdi
 vaddss (%rsi), %xmm0, %xmm0
 vmovss %xmm0, -4(%rdi)
 cmpq %rdi, %rax
 jne .L6
 rep ret
  • Line 2 means rdx = &value + 1
  • Line 3: rax = ptr + 1024
  • Lines 4-5 mean if(&value < ptr) { goto L7; } // <- do the vector loop
  • Lines 6-7 mean if(&value < ptr + 1024) { goto L6; } // <- do the scalar loop
  • lines 9-17 are the vectorized loop
  • lines 19-25 are the serial loop
Unlike the code that I suggested in my previous article, you can see that GCC gives up on vectorization if &value is within the span of the array. According to Eric Brumer, a developer of the Microsoft compiler, at his “Compiler Confidential” presentation, the Microsoft compiler will keep trying to vectorize, getting to perform very complicated case analysis.

Of course that one should never pass by const reference what could be passed by value without incurring the cost of a copy, but this shows what happens if you do, the compiler is forced to deal with aliasing and no matter what, performance is lost, even if just to make the now necessary checks to detect aliasing.

There is one case in which C++ programmers routinely indicate passing by const reference, even when the arguments may be as small as simple floating point:  In templates, when the actual type is not known at programming time; to prevent problems of potentially very expensive copy constructors we have been defaulting for more than 20 years to const references (or pointers).

What is the solution?  It is not practical to cover the known cases in which passing by value is superior with template specializations.  The answers I have devised thus far are incomplete, but lead to code that is both higher performing and easier to understand.  I will be writing about them.

No comments:

Post a Comment