Thursday, June 16, 2016

News to me of higher costs of passing by value

BOOST preprocessing guide not written yet

TL;DR: smart pointers unavoidably imply one extra indirection as arguments to non-inlined functions

Recently I was teaching a class about smart pointers where I said unique_ptr normally does not have any space overhead, and I tried to prove it by showing the assembler of dereferencing one, but the assembler was not what I expected! Later I realized there is an important cost associated with all smart pointers I did not know about, at least in the Intel/AMD64 world

Normally, when returning or passing a value of a class type, the value will be passed by registers. For example, for this code, precursor to an implementation of “yet another shared ownership pointer”, there will be a concept of a reference counted box that contains the actual value, and the smart pointer will point to that:

template<typename T> struct yasop_precursor {
    T &operator*() { return box->real_value; }
private:
    struct Box { long ref_count; T real_value; };
    Box *box;
};

long dereference(yasop_precursor<long> yp) { return *yp; }

The generated assembler for the non-inline function dereference is as expected:

dereference(yasop_precursor<long>):
    mov     rax, QWORD PTR [rdi+8]
    ret

That is, simply take the value of the function argument yp, which is a pointer, and offset it by 8 bytes because of the ref_count.

However, if the only change is to put a custom destructor:

template<typename T> struct yasop {
    T &operator*() { return box->real_value; }

    ~yasop();
private:
    struct Box { long ref_count; T real_value; };
    Box *box;
};

The assembler changes:

dereference(yasop<long>):
    mov     rax, QWORD PTR [rdi]
    mov     rax, QWORD PTR [rax+8]
    ret

Now the smart pointer is not passed by register, but by reference, (rdi points to the smart pointer, which in turns points to the reference counted box). I mean the argument is still a copy of whatever was given to the function dereference, but the function does not receive the value of the copy, but a reference to it, therefore, it requires one extra indirection to use.

Thinking about it, it almost makes sense: Normally, the ABI for passing by value is such that if the struct or class value fits in registers, it will be passed in the registers; if it is too big, then a copy will be made and the address of the copy will be passed instead. This is what we see in the first derefence for yasop_precursor: Their values only contain a box pointer, 8 bytes, that of course fits into one register, hence it is passed by register. However, if the class has a custom destructor, to be able to destroy the copy implied in passing by value, its address has got to exist. Regardless of what the destructor does. Since there will be a copy of the value (in the stack), the ABI designers decided the argument may as well be passed by reference, as stated in the specification, Chapter 3.1.1.1 first item.

The same argument applies to custom copy constructors: There is no copy constructor from a value (otherwise, the copy constructor for the argument itself would be called!), since the copy constructor must receive a reference, all values of types that have custom copy constructors will also be downgraded from passing by register into passing by reference.

There is an important note to bear: the standard allows eliding the copy of function arguments when they truly are temporaries, even if the copy constructor or destructor have side effects, that is, as if the copy never happened.

Because of this reason it is more important that functions that receive smart pointers be inlined. Also, passing by value smart pointers is of dubious benefit.

Note: What I said before does not imply there is the cost of one extra indirection for smart pointers, only when they are passed as arguments or returned from functions. The overhead of most smart pointers has been measured to be not worse than 1% of runtime in real life applications.