Wednesday, November 4, 2015

Gaining clarity and performance

We will now look into the implementation of several argument passing choices.

Most of the argument passing is governed by the ABI, the “Application Binary Interface” which is the convention compiler vendors agree on so that code compiled with one vendor can be used by code compiled by another. Since I am primarily an x86-64 on Linux programmer I will concentrate almost exclusively on this platform, the most recent ABI can be seen here.

We saw in the article “Aliasing and autovectorization” how aliasing disrupts autovectorization. We will focus now on a simpler case, when passing by const reference versus passing by value only lead to the apparently small difference of using a parameter from memory versus using it from register.

Again we have to reduce the real life complications such as alignment. Previously I used the GCC extension __builtin_assume_aligned, which is also supported by Clang versions 3.5 and higher; however, that extension can only be applied to pointers, and I would like to move away from pointers. I will use GCC’s extension to define SIMD (Single Instruction Multiple Data) vector types, which is also supported by Clang. I don’t recommend using GCC’s SIMD vectors, I think they are actually broken. The situation is better with Clang. This is a subject matter that will be revisited when discussing the emerging superiority of Clang over GCC. Be it as it may, if the vector size is supported by the platform, SIMD vectors implicitly set the alignment to the vector size. By the way, if you can successfully run SIMD vectors on x86 using GCC’s extension, you can almost rest assured your code will be just as vectorized in any other platform because x86’s SSE instruction set is probably the hardest to use.

In the same example I’ve been using, this time I forced the basic data type to be a vector of 4 doubles, see this code that simply adds a value to the each element of the vector:
#include <vector>
 
// typedefs v4d as a vector of 32 bytes in which elements are doubles
// that is, a vector of 4 doubles
using v4d = double __attribute__((vector_size(32)));

void addVector(std::vector<v4d> &v, v4d value) {
    for(auto &e: v) { e += value; }
}

void addVectorByCR(std::vector<v4d> &v, const v4d &value) {
    for(auto &e: v) { e += value; }
}
The awesome tool Online GCC Explorer lets us see what GCC 5.2 --among other compilers and versions--, with 256-bit vector registers enabled (compiler options -mavx and -mavx2), produces:
addVector(std::vector<double __vector(4), std::allocator<double __vector(4)> >&, double __vector(4)):
    movq    8(%rdi), %rdx
    movq    (%rdi), %rax
    cmpq    %rdx, %rax
    je  .L7
.L5:
    vaddpd  (%rax), %ymm0, %ymm1
    addq    $32, %rax
    vmovapd %ymm1, -32(%rax)
    cmpq    %rax, %rdx
    jne .L5
.L7:
    vzeroupper
    ret
addVectorByCR(std::vector<double __vector(4), std::allocator<double __vector(4)> >&, double __vector(4) const&):
    movq    8(%rdi), %rdx
    movq    (%rdi), %rax
    cmpq    %rdx, %rax
    je  .L15
.L13:
    vmovapd (%rax), %ymm0
    addq    $32, %rax
    vaddpd  (%rsi), %ymm0, %ymm0
    vmovapd %ymm0, -32(%rax)
    cmpq    %rax, %rdx
    jne .L13
    vzeroupper
.L15:
    rep ret
We can see that there is only a deceptively small difference between passing by value the addendum (addVector) and passing it by const reference (addVectorByCR), let us explain the generated code for both versions at the same time since they are almost identical, which was the point:
  • The first two lines (2-3, and 16-17) are just loading the vector begin and end into rdx and rax (by the way, this clearly shows that the memory layout of vectors begins with two pointers, the buffer and its end),
  • Lines 4-5 or 18-19 are simply a check for empty vector
  • lines 7-11 or 21-26 are the main loop:
    • line 7 adds *rax to ymm0 into ymm1, that is, value was passed by value into ymm0.
    • Since the addendum is passed by const reference into addVectorByCR, the first thing is to load *rax into a vector register, ymm0, at line 21
    • lines 8 and 22 simply makes rax point to the next element in the vector
    • line 23 performs the addition from memory through the pointer rsi, this is the only difference
    • lines 10-11 and 25-26 just decide whether to loop back.
  • lines 13 and 27 are just an ABI artifact. Perhaps I will devote an article to “VZEROUPPER” after some preliminaries of SSE have been exposed.
From the point of view of performance, I would say the gcc 5.2 thinks adding from memory is about as fast as adding from a register, and it probably is, since value will be in the L1 cache for most of the iterations. But what would happen if another thread writes, not even to the actual location of &value, but just to the same cache line as &value?: The cache line for &value will be invalidated, thus, the thread iterating will stall while bringing the value of the addendum from memory to the L1 cache, this will have a latency equal to the latency of the cache level at which the cache line for &value is shared with the modifier, potentially all the way to main memory, which is about 100 times slower than L1. For sure it will be L2 at least (since in all x86 implementations the L1 cache is exclusive) which is around 4 times slower than L1. What if the thing the other thread is writing to is something like an accumulator of all the values in a vector? The other value will be changing very often, thus the cache line will be invalidated often and things may run 4, 10, 100 times slower… This problem is referred to as “false sharing”.

This is not an hypothetical scenario, false sharing is illustrated in plenty of good resources available online, including this presentation, around minute 8 by Scott Meyers.

So, we have another set of cases in which the apparently innocuous difference between passing by value and by const reference may lead to dramatic performance differences.

For small data types, that hopefully have copy constructors without side effects, it is clear one ought to pass them by value as much as possible.

There is some support to discover both things in templates: The operator sizeof will tell you the size of the type, and a stronger set of conditions on the nature of the copy constructor can be queried through is_trivially_copyable, so, with some effort (not illustrated now, but it will be in the future), it is possible to design templates that make an informed decision on whether to accept arguments by value or const reference.

Having the small size covered, let us explore the argument passing with increasingly larger types. If a type is sufficiently large, the ABI will prescribe the passing by value to be implemented not by register, but by stack, that is, as a “hidden reference” to the stack. This is illustrated by creating a structure big enough, Big, and using the same algorithm, add.
#include <vector>

// typedefs v4d as a vector of 32 bytes in which elements are doubles
// that is, a vector of 4 doubles
using v4d = double __attribute__((vector_size(32)));

struct Big {
    int otherStuff[32]; // 32 members of 4 bytes each
    v4d a, b, c, d; // 4 members of 32 bytes each

    Big &operator+=(const Big &v) { a += v.a; b += v.b; c += v.c; d += v.d; }
};

void add(std::vector<Big> &v, Big addendum) {
    for(auto &e: v) { e += addendum; }
}

void addByCR(std::vector<Big> &v, const Big &addendum) {
    for(auto &e: v) { e += addendum; }
}

Big opaqueFunction();

void opaqueAdd(std::vector<Big> &v, Big value);

void informedCall(std::vector<Big> &v) {
    Big addendum = opaqueFunction();
    add(v, addendum);
}

void uninformedCall(std::vector<Big> &v) {
    Big addendum = opaqueFunction();
    opaqueAdd(v, addendum);
}
Which gives this result:
add(std::vector<Big, std::allocator<Big> >&, Big):
    movq    (%rdi), %rax
    movq    8(%rdi), %rdx
    cmpq    %rdx, %rax
    je  .L6
    vmovapd 136(%rsp), %ymm4
    vmovapd 168(%rsp), %ymm3
    vmovapd 200(%rsp), %ymm2
    vmovapd 232(%rsp), %ymm1
.L3:
    vaddpd  128(%rax), %ymm4, %ymm0
    addq    $256, %rax
    vmovapd %ymm0, -128(%rax)
    vaddpd  -96(%rax), %ymm3, %ymm0
    vmovapd %ymm0, -96(%rax)
    vaddpd  -64(%rax), %ymm2, %ymm0
    vmovapd %ymm0, -64(%rax)
    vaddpd  -32(%rax), %ymm1, %ymm0
    vmovapd %ymm0, -32(%rax)
    cmpq    %rax, %rdx
    jne .L3
    vzeroupper
.L6:
    rep ret
addByCR(std::vector<Big, std::allocator<Big> >&, Big const&):
    movq    (%rdi), %rax
    movq    8(%rdi), %rdx
    cmpq    %rdx, %rax
    je  .L14
.L12:
    vmovapd 128(%rax), %ymm0
    addq    $256, %rax
    vaddpd  128(%rsi), %ymm0, %ymm0
    vmovapd %ymm0, -128(%rax)
    vmovapd -96(%rax), %ymm0
    vaddpd  160(%rsi), %ymm0, %ymm0
    vmovapd %ymm0, -96(%rax)
    vmovapd -64(%rax), %ymm0
    vaddpd  192(%rsi), %ymm0, %ymm0
    vmovapd %ymm0, -64(%rax)
    vmovapd -32(%rax), %ymm0
    vaddpd  224(%rsi), %ymm0, %ymm0
    vmovapd %ymm0, -32(%rax)
    cmpq    %rax, %rdx
    jne .L12
    vzeroupper
.L14:
    rep ret
informedCall(std::vector<Big, std::allocator<Big> >&):
    leaq    8(%rsp), %r10
    andq    $-32, %rsp
    pushq   -8(%r10)
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r10
    pushq   %rbx
    movq    %rdi, %rbx
    leaq    -272(%rbp), %rdi
    subq    $256, %rsp
    call    opaqueFunction()
    movq    (%rbx), %rax
    movq    8(%rbx), %rdx
    vmovapd -144(%rbp), %ymm4
    vmovapd -112(%rbp), %ymm3
    cmpq    %rdx, %rax
    vmovapd -80(%rbp), %ymm2
    vmovapd -48(%rbp), %ymm1
    je  .L21
.L19:
    vaddpd  128(%rax), %ymm4, %ymm0
    addq    $256, %rax
    vmovapd %ymm0, -128(%rax)
    vaddpd  -96(%rax), %ymm3, %ymm0
    vmovapd %ymm0, -96(%rax)
    vaddpd  -64(%rax), %ymm2, %ymm0
    vmovapd %ymm0, -64(%rax)
    vaddpd  -32(%rax), %ymm1, %ymm0
    vmovapd %ymm0, -32(%rax)
    cmpq    %rax, %rdx
    jne .L19
.L21:
    vzeroupper
    addq    $256, %rsp
    popq    %rbx
    popq    %r10
    popq    %rbp
    leaq    -8(%r10), %rsp
    ret
uninformedCall(std::vector<Big, std::allocator<Big> >&):
    leaq    8(%rsp), %r10
    andq    $-32, %rsp
    pushq   -8(%r10)
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r12
    pushq   %r10
    pushq   %rbx
    leaq    -304(%rbp), %rbx
    movq    %rdi, %r12
    subq    $280, %rsp
    movq    %rbx, %rdi
    call    opaqueFunction()
    subq    $256, %rsp
    movq    %rbx, %rsi
    movl    $32, %ecx
    movq    %rsp, %rdi
    rep movsq
    movq    %r12, %rdi
    call    opaqueAdd(std::vector<Big, std::allocator<Big> >&, Big)
    addq    $256, %rsp
    leaq    -24(%rbp), %rsp
    popq    %rbx
    popq    %r10
    popq    %r12
    popq    %rbp
    leaq    -8(%r10), %rsp
    ret
The resulting lines 6 to 9 showing registers ymm4 to ymm1 initialized to a, b, c, d through stack addresses prove that the argument addendum to add was copied to the stack for the call. addByCR shows what we expect, the additions are performed directly from memory all the time, but there is no copying (the lines that follow the pattern vaddpd 128(%rsi), %ymm0, %ymm0, with constants 160, 192, 224. Is this good? Before answering, notice that there are two differences between passing by value and by const reference in this case of big arguments:
  • There is at least some copying to the stack versus no copying,
  • and the compiler may decide to use registers locally (as shown in lines 6 to 9) despite the fact that the arguments are in memory to do the work in the case of passing by value but not in the case of passing by reference. That is, with large types, passing by value is actually a hybrid between theoretical passing by value and passing by const reference.
What about the other 32 integers in the member otherSuff? those are 128 bytes that must be copied to the stack, right? this is what passing by const reference tries to avoid, it has got to be expensive.

We don’t yet see anything about otherStuff because the callee does not care about it, then let us study what the compiler did for two cases of callers:
  • when the caller knows the implementation of the function, informedCall,
  • and when it does not, uninformedCall.
informedCall calls add, and it knows its implementation, as well as the implementation of the copy constructor for Big, which lets it perform the inlining we see, where those 128 bytes are not copied, all that is done with memory is to load to registers the members ad produced by opaqueFunction in lines 63, 64, 66 and 68. In the ‘uninformed’ case, lines 105 to 107 show the copy of the whole value, 32 8-byte things (this is using the move string operation, which roughly translated means while(rcx--) { *rdi++ = *rsi++; }, with pointed-to sizes of 8, remember the register rsi contains the address of the addendum). But the informed call did not copy everything, thus, passing by value a type that respects value semantics to a function whose implementation is known by the compiler won’t suffer from unnecessary copying nor aliasing! this is especially important because the great mayority of template function calls are such that the caller has the implementation available!

This observation can be leveraged in practice very easily:
  • In template implementations residing in headers that users will #include, if the copy constructor is sane, like it does not have side effects, you can almost be certain at most the things that matter will be copied and this only after the compiler determines the copy necessary or advantageous. I have gotten confident about passing by value to implementation-known templates because I am very strict about the copy constructors I program --and lazy!, since I very seldom use anything but the default!–
  • If you want to keep your implementation secret, before falling back to const referencing, you can ask:
    • If the function messes around with all or almost all of the argument, then pass by value since the copy construction prevented by const referencing will be done piece-meal anyway in the body
    • If the function will use only a small subset of the fields, isn’t the function telling you that its interface may be wrong, because it receives too much?
      • If the function is indeed receiving the whole tropical rainforest when it only requires a banana you have a problem.
      • But sometimes your interface can legitimately require the whole tropical rainforest because whether you only use a banana may be an implementation detail of no concern to the user, or something that may change in the future. Let us explore this case in more detail, because there may be a way:
Let’s say you have a big type, WholeKitchen, and your design legitimately requires your secretSauce function to receive it all. Furthermore, you neither want to provide the implementation. This is the traditional signature one would arrive at:
Something secretSauce(const WholeKitchen &kitchen);
And it may be fine, but could this be acceptable?:
Something notAsSecretASauce(Stove s, Water w, TomatoContainer tc, Oil o, Salt s);

inline Something secretSauce(WholeKitchen kitchen) {
    auto &k = kitchen;
    return notAsSecretASauce(k.stove, k.water, k.tomatoes, k.oliveOil, k.salt);
}
That way you can call ‘secretSauce’ repeatedly, never suffer aliasing, decimate the chances for false sharing, be assured the value won’t change and keep the option to change the implementation without breaking user code. It works, because as illustrated above, the inlining will extract or copy in the most advantageous way the members actually used in the function that does the work. Observe also that the extra work performed to pass by value may actually improve performance, since the split between secretSauce and notAsSecretASauce did not introduce any performance penalty, but allows the compiler to optimize for the members stove, …, salt; furthermore, if some type in notAsSecretASauce should be expensive to copy, the same technique could be applied, potentially ending up with only small parameters fully optimized, simply a side effect of letting the compiler know more.

Summarizing, for value semantics types and templates, one ought to ask oneself, “do I really want to pass by const reference?”

I discovered these facts about a couple of years ago, and got serious about it especially after I read “Want Speed? Pass by Value” by Dave Abrahams --the original is not available anymore, but the WayBackMachine has a copy. It’s not that it is a done deal, the language semantics today sometimes conspire against passing by value, and there are other issues, the thing is that I have gained experience at aggressively switching from the traditional passing by const reference to passing by value, and I can say that without a doubt, I’ve gotten code that is much easier to understand, because nothing beats passing by value with regards to clarity, especially because it is of great assistance to reduce mutability, and performs better.

When I have had to keep const references it usually is because of code I can’t change that relies on side effects, my function arguments need to reflect changes made elsewhere or indirectly.

I have developed aversion to code that unnecessarily uses references and mutability; it used to be equivalent, but not anymore. I don't think I am alone in this, so, if you want to make not just future-proof, but present-proof code, be more strict about your use of references and mutability.

Also, C++ 11 improves dramatically the support for passing by value, all of these things make for lots of things to talk about in the following articles.

It may be easy to misinterpret a point of this article as saying that passing by value works better if things are inlined aggressively. Inlining aggressively might be performance-detrimental, the warning to not be penny-wise on the passing by value and dollar-foolish on over-inlining needs mentioning. What I have seen, from the point of view of passing by value, is that it reinforces mutually with value semantics, in value semantics the copy constructor tends to be the default bitwise copy, that allows the compiler to reason about the members and do very clever optimizations. As always, inline only what makes sense.

No comments:

Post a Comment