Most of the argument passing is governed by the ABI, the “Application Binary Interface” which is the convention compiler vendors agree on so that code compiled with one vendor can be used by code compiled by another. Since I am primarily an x86-64 on Linux programmer I will concentrate almost exclusively on this platform, the most recent ABI can be seen here.
We saw in the article “Aliasing and autovectorization” how aliasing disrupts autovectorization. We will focus now on a simpler case, when passing by const reference versus passing by value only lead to the apparently small difference of using a parameter from memory versus using it from register.
Again we have to reduce the real life complications such as alignment. Previously I used the GCC extension
__builtin_assume_aligned
, which is also supported by Clang versions 3.5 and higher; however, that extension can only be applied to pointers, and I would like to move away from pointers. I will use GCC’s extension to define SIMD (Single Instruction Multiple Data) vector types, which is also supported by Clang. I don’t recommend using GCC’s SIMD vectors, I think they are actually broken. The situation is better with Clang. This is a subject matter that will be revisited when discussing the emerging superiority of Clang over GCC. Be it as it may, if the vector size is supported by the platform, SIMD vectors implicitly set the alignment to the vector size. By the way, if you can successfully run SIMD vectors on x86 using GCC’s extension, you can almost rest assured your code will be just as vectorized in any other platform because x86’s SSE instruction set is probably the hardest to use.In the same example I’ve been using, this time I forced the basic data type to be a vector of 4 doubles, see this code that simply adds a value to the each element of the vector:
#include <vector>
// typedefs v4d as a vector of 32 bytes in which elements are doubles // that is, a vector of 4 doubles using v4d = double __attribute__((vector_size(32))); void addVector(std::vector<v4d> &v, v4d value) { for(auto &e: v) { e += value; } } void addVectorByCR(std::vector<v4d> &v, const v4d &value) { for(auto &e: v) { e += value; } }The awesome tool Online GCC Explorer lets us see what GCC 5.2 --among other compilers and versions--, with 256-bit vector registers enabled (compiler options
-mavx
and -mavx2
), produces:addVector(std::vector<double __vector(4), std::allocator<double __vector(4)> >&, double __vector(4)): movq 8(%rdi), %rdx movq (%rdi), %rax cmpq %rdx, %rax je .L7 .L5: vaddpd (%rax), %ymm0, %ymm1 addq $32, %rax vmovapd %ymm1, -32(%rax) cmpq %rax, %rdx jne .L5 .L7: vzeroupper ret addVectorByCR(std::vector<double __vector(4), std::allocator<double __vector(4)> >&, double __vector(4) const&): movq 8(%rdi), %rdx movq (%rdi), %rax cmpq %rdx, %rax je .L15 .L13: vmovapd (%rax), %ymm0 addq $32, %rax vaddpd (%rsi), %ymm0, %ymm0 vmovapd %ymm0, -32(%rax) cmpq %rax, %rdx jne .L13 vzeroupper .L15: rep retWe can see that there is only a deceptively small difference between passing by value the addendum (addVector) and passing it by const reference (addVectorByCR), let us explain the generated code for both versions at the same time since they are almost identical, which was the point:
- The first two lines (2-3, and 16-17) are just loading the vector begin and end into rdx and rax (by the way, this clearly shows that the memory layout of vectors begins with two pointers, the buffer and its end),
- Lines 4-5 or 18-19 are simply a check for empty vector
- lines 7-11 or 21-26 are the main loop:
- line 7 adds *rax to ymm0 into ymm1, that is,
value
was passed by value into ymm0. - Since the addendum is passed by const reference into addVectorByCR, the first thing is to load *rax into a vector register, ymm0, at line 21
- lines 8 and 22 simply makes rax point to the next element in the vector
- line 23 performs the addition from memory through the pointer rsi, this is the only difference
- lines 10-11 and 25-26 just decide whether to loop back.
- line 7 adds *rax to ymm0 into ymm1, that is,
- lines 13 and 27 are just an ABI artifact. Perhaps I will devote an article to “VZEROUPPER” after some preliminaries of SSE have been exposed.
value
will be in the L1 cache for most of the iterations. But what would happen if another thread writes, not even to the actual location of &value
, but just to the same cache line as &value
?: The cache line for &value
will be invalidated, thus, the thread iterating will stall while bringing the value of the addendum from memory to the L1 cache, this will have a latency equal to the latency of the cache level at which the cache line for &value
is shared with the modifier, potentially all the way to main memory, which is about 100 times slower than L1. For sure it will be L2 at least (since in all x86 implementations the L1 cache is exclusive) which is around 4 times slower than L1. What if the thing the other thread is writing to is something like an accumulator of all the values in a vector? The other value will be changing very often, thus the cache line will be invalidated often and things may run 4, 10, 100 times slower… This problem is referred to as “false sharing”.This is not an hypothetical scenario, false sharing is illustrated in plenty of good resources available online, including this presentation, around minute 8 by Scott Meyers.
So, we have another set of cases in which the apparently innocuous difference between passing by value and by const reference may lead to dramatic performance differences.
For small data types, that hopefully have copy constructors without side effects, it is clear one ought to pass them by value as much as possible.
There is some support to discover both things in templates: The operator
sizeof
will tell you the size of the type, and a stronger set of conditions on the nature of the copy constructor can be queried through is_trivially_copyable
, so, with some effort (not illustrated now, but it will be in the future), it is possible to design templates that make an informed decision on whether to accept arguments by value or const reference.Having the small size covered, let us explore the argument passing with increasingly larger types. If a type is sufficiently large, the ABI will prescribe the passing by value to be implemented not by register, but by stack, that is, as a “hidden reference” to the stack. This is illustrated by creating a structure big enough,
Big
, and using the same algorithm, add
.#include <vector> // typedefs v4d as a vector of 32 bytes in which elements are doubles // that is, a vector of 4 doubles using v4d = double __attribute__((vector_size(32))); struct Big { int otherStuff[32]; // 32 members of 4 bytes each v4d a, b, c, d; // 4 members of 32 bytes each Big &operator+=(const Big &v) { a += v.a; b += v.b; c += v.c; d += v.d; } }; void add(std::vector<Big> &v, Big addendum) { for(auto &e: v) { e += addendum; } } void addByCR(std::vector<Big> &v, const Big &addendum) { for(auto &e: v) { e += addendum; } } Big opaqueFunction(); void opaqueAdd(std::vector<Big> &v, Big value); void informedCall(std::vector<Big> &v) { Big addendum = opaqueFunction(); add(v, addendum); } void uninformedCall(std::vector<Big> &v) { Big addendum = opaqueFunction(); opaqueAdd(v, addendum); }Which gives this result:
add(std::vector<Big, std::allocator<Big> >&, Big): movq (%rdi), %rax movq 8(%rdi), %rdx cmpq %rdx, %rax je .L6 vmovapd 136(%rsp), %ymm4 vmovapd 168(%rsp), %ymm3 vmovapd 200(%rsp), %ymm2 vmovapd 232(%rsp), %ymm1 .L3: vaddpd 128(%rax), %ymm4, %ymm0 addq $256, %rax vmovapd %ymm0, -128(%rax) vaddpd -96(%rax), %ymm3, %ymm0 vmovapd %ymm0, -96(%rax) vaddpd -64(%rax), %ymm2, %ymm0 vmovapd %ymm0, -64(%rax) vaddpd -32(%rax), %ymm1, %ymm0 vmovapd %ymm0, -32(%rax) cmpq %rax, %rdx jne .L3 vzeroupper .L6: rep ret addByCR(std::vector<Big, std::allocator<Big> >&, Big const&): movq (%rdi), %rax movq 8(%rdi), %rdx cmpq %rdx, %rax je .L14 .L12: vmovapd 128(%rax), %ymm0 addq $256, %rax vaddpd 128(%rsi), %ymm0, %ymm0 vmovapd %ymm0, -128(%rax) vmovapd -96(%rax), %ymm0 vaddpd 160(%rsi), %ymm0, %ymm0 vmovapd %ymm0, -96(%rax) vmovapd -64(%rax), %ymm0 vaddpd 192(%rsi), %ymm0, %ymm0 vmovapd %ymm0, -64(%rax) vmovapd -32(%rax), %ymm0 vaddpd 224(%rsi), %ymm0, %ymm0 vmovapd %ymm0, -32(%rax) cmpq %rax, %rdx jne .L12 vzeroupper .L14: rep ret informedCall(std::vector<Big, std::allocator<Big> >&): leaq 8(%rsp), %r10 andq $-32, %rsp pushq -8(%r10) pushq %rbp movq %rsp, %rbp pushq %r10 pushq %rbx movq %rdi, %rbx leaq -272(%rbp), %rdi subq $256, %rsp call opaqueFunction() movq (%rbx), %rax movq 8(%rbx), %rdx vmovapd -144(%rbp), %ymm4 vmovapd -112(%rbp), %ymm3 cmpq %rdx, %rax vmovapd -80(%rbp), %ymm2 vmovapd -48(%rbp), %ymm1 je .L21 .L19: vaddpd 128(%rax), %ymm4, %ymm0 addq $256, %rax vmovapd %ymm0, -128(%rax) vaddpd -96(%rax), %ymm3, %ymm0 vmovapd %ymm0, -96(%rax) vaddpd -64(%rax), %ymm2, %ymm0 vmovapd %ymm0, -64(%rax) vaddpd -32(%rax), %ymm1, %ymm0 vmovapd %ymm0, -32(%rax) cmpq %rax, %rdx jne .L19 .L21: vzeroupper addq $256, %rsp popq %rbx popq %r10 popq %rbp leaq -8(%r10), %rsp ret uninformedCall(std::vector<Big, std::allocator<Big> >&): leaq 8(%rsp), %r10 andq $-32, %rsp pushq -8(%r10) pushq %rbp movq %rsp, %rbp pushq %r12 pushq %r10 pushq %rbx leaq -304(%rbp), %rbx movq %rdi, %r12 subq $280, %rsp movq %rbx, %rdi call opaqueFunction() subq $256, %rsp movq %rbx, %rsi movl $32, %ecx movq %rsp, %rdi rep movsq movq %r12, %rdi call opaqueAdd(std::vector<Big, std::allocator<Big> >&, Big) addq $256, %rsp leaq -24(%rbp), %rsp popq %rbx popq %r10 popq %r12 popq %rbp leaq -8(%r10), %rsp retThe resulting lines 6 to 9 showing registers ymm4 to ymm1 initialized to
a
, b
, c
, d
through stack addresses prove that the argument addendum
to add
was copied to the stack for the call. addByCR
shows what we expect, the additions are performed directly from memory all the time, but there is no copying (the lines that follow the pattern vaddpd 128(%rsi), %ymm0, %ymm0
, with constants 160, 192, 224. Is this good? Before answering, notice that there are two differences between passing by value and by const reference in this case of big arguments:- There is at least some copying to the stack versus no copying,
- and the compiler may decide to use registers locally (as shown in lines 6 to 9) despite the fact that the arguments are in memory to do the work in the case of passing by value but not in the case of passing by reference. That is, with large types, passing by value is actually a hybrid between theoretical passing by value and passing by const reference.
otherSuff
? those are 128 bytes that must be copied to the stack, right? this is what passing by const reference tries to avoid, it has got to be expensive.We don’t yet see anything about
otherStuff
because the callee does not care about it, then let us study what the compiler did for two cases of callers:- when the caller knows the implementation of the function,
informedCall
, - and when it does not,
uninformedCall
.
informedCall
calls add
, and it knows its implementation, as well as the implementation of the copy constructor for Big
, which lets it perform the inlining we see, where those 128 bytes are not copied, all that is done with memory is to load to registers the members a
… d
produced by opaqueFunction
in lines 63, 64, 66 and 68. In the ‘uninformed’ case, lines 105 to 107 show the copy of the whole value, 32 8-byte things (this is using the move string operation, which roughly translated means while(rcx--) { *rdi++ = *rsi++; }
, with pointed-to sizes of 8, remember the register rsi contains the address of the addendum). But the informed call did not copy everything, thus, passing by value a type that respects value semantics to a function whose implementation is known by the compiler won’t suffer from unnecessary copying nor aliasing! this is especially important because the great mayority of template function calls are such that the caller has the implementation available!This observation can be leveraged in practice very easily:
- In template implementations residing in headers that users will
#include
, if the copy constructor is sane, like it does not have side effects, you can almost be certain at most the things that matter will be copied and this only after the compiler determines the copy necessary or advantageous. I have gotten confident about passing by value to implementation-known templates because I am very strict about the copy constructors I program --and lazy!, since I very seldom use anything but the default!– - If you want to keep your implementation secret, before falling back to const referencing, you can ask:
- If the function messes around with all or almost all of the argument, then pass by value since the copy construction prevented by const referencing will be done piece-meal anyway in the body
- If the function will use only a small subset of the fields, isn’t the function telling you that its interface may be wrong, because it receives too much?
- If the function is indeed receiving the whole tropical rainforest when it only requires a banana you have a problem.
- But sometimes your interface can legitimately require the whole tropical rainforest because whether you only use a banana may be an implementation detail of no concern to the user, or something that may change in the future. Let us explore this case in more detail, because there may be a way:
WholeKitchen
, and your design legitimately requires your secretSauce
function to receive it all. Furthermore, you neither want to provide the implementation. This is the traditional signature one would arrive at:Something secretSauce(const WholeKitchen &kitchen);And it may be fine, but could this be acceptable?:
Something notAsSecretASauce(Stove s, Water w, TomatoContainer tc, Oil o, Salt s); inline Something secretSauce(WholeKitchen kitchen) { auto &k = kitchen; return notAsSecretASauce(k.stove, k.water, k.tomatoes, k.oliveOil, k.salt); }That way you can call ‘secretSauce’ repeatedly, never suffer aliasing, decimate the chances for false sharing, be assured the value won’t change and keep the option to change the implementation without breaking user code. It works, because as illustrated above, the inlining will extract or copy in the most advantageous way the members actually used in the function that does the work. Observe also that the extra work performed to pass by value may actually improve performance, since the split between
secretSauce
and notAsSecretASauce
did not introduce any performance penalty, but allows the compiler to optimize for the members stove
, …, salt
; furthermore, if some type in notAsSecretASauce
should be expensive to copy, the same technique could be applied, potentially ending up with only small parameters fully optimized, simply a side effect of letting the compiler know more.Summarizing, for value semantics types and templates, one ought to ask oneself, “do I really want to pass by const reference?”
I discovered these facts about a couple of years ago, and got serious about it especially after I read “Want Speed? Pass by Value” by Dave Abrahams --the original is not available anymore, but the WayBackMachine has a copy. It’s not that it is a done deal, the language semantics today sometimes conspire against passing by value, and there are other issues, the thing is that I have gained experience at aggressively switching from the traditional passing by const reference to passing by value, and I can say that without a doubt, I’ve gotten code that is much easier to understand, because nothing beats passing by value with regards to clarity, especially because it is of great assistance to reduce mutability, and performs better.
When I have had to keep const references it usually is because of code I can’t change that relies on side effects, my function arguments need to reflect changes made elsewhere or indirectly.
I have developed aversion to code that unnecessarily uses references and mutability; it used to be equivalent, but not anymore. I don't think I am alone in this, so, if you want to make not just future-proof, but present-proof code, be more strict about your use of references and mutability.
Also, C++ 11 improves dramatically the support for passing by value, all of these things make for lots of things to talk about in the following articles.
It may be easy to misinterpret a point of this article as saying that passing by value works better if things are inlined aggressively. Inlining aggressively might be performance-detrimental, the warning to not be penny-wise on the passing by value and dollar-foolish on over-inlining needs mentioning. What I have seen, from the point of view of passing by value, is that it reinforces mutually with value semantics, in value semantics the copy constructor tends to be the default bitwise copy, that allows the compiler to reason about the members and do very clever optimizations. As always, inline only what makes sense.
No comments:
Post a Comment