Wednesday, September 27, 2017

Moved to GitHub

I have been writing new material in GitHub:

The Zoo at GitHub

Lately I have concentrated more into developing usable libraries than discussing software engineering things.

You may check these links:

Integer Undefined Behavior

Tail recursion issues with Clang

How I made what may be the fastest general purpose processor single core implementation of how to rank or compare hands of Poker:

Pokerbotic design documents

The writing of documents mixing code with text in Markdown is so much easier that I think I will continue to write most of the material in GitHub

Sunday, October 9, 2016

constexpr offsetof, a practical way to find the offset of a member in a constexpr

constexpr function to find the offset of a data member

I’ve been working on high performing code and needed to know at compile time whether the offset of a member from the beginning of its containing object is exactly a multiple of a SIMD vector size.

There exists the standard C-library macro offsetof swept into C++. For something as simple as a concrete structure and member, we can get its offset:

#include <cstddef>

struct ConcreteAggregate {
   double member1;
   int concreteMember;
};

static_assert(8 == offsetof(ConcreteAggregate, concreteMember), "");

However, offsetof is extremely limited, this slight change, of moving "member1" to a concrete base class, is already too much for offsetof:

#include <cstddef>

struct Base { double member1; };
struct Derived: Base { int concreteMember; };

static_assert(8 == offsetof(Derived, concreteMember);

Fails to compile. offsetof requires the type to be of standard layout, which for C++ metaprogramming already makes it useless.

Besides, the macro requires the name of the member. In templates you don’t know the name of the members, what you can have is its address as member pointer.

There’s a simple trick that would give the answer, if we had an object of the given type, we could compare its address to the address of the member:

template <typename T, typename MT, MT T::*MPtr>
std::size_t offset_of(const T &value) {
    return (char *)&(value.*MPtr) - (char *)&value;
}

std::size_t fun(const Derived &d) {
    return offset_of<Derived, int, &Derived::concreteMember>(d);
}

Will “work”, but alas, not in a constexpr. It turns out casting is forbidden in constexprs. Besides, that function is not really very useful since it requires a value of the given type. It should not require it, because what does it matter, for the purposes of determining the offset of a member, where is the value? – A null pointer can not be dereferenced, not even in a sort of un-evaluated context such as a constexpr.

I’ve been trying to find a way to solve this issue in an standard compliant way, and I am almost certain there is no way because there are no conversions allowed in standard C++ that would give us “char pointers” to calculate the bytes.

However, I think I have devised a portable solution: through a sleight of hand the code tricks the compiler into using rules for constant expressions of C++ 98 which allowed these casts. Because current compilers compile C++ 98, this trick in practice should work:

namespace detail {

template<typename T> struct declval_helper { static T value; };

template<typename T, typename Z, Z T::*MPtr>
struct offset_helper {
    using TV = declval_helper<T>;
    char for_sizeof[
        (char *)&(TV::value.*MPtr) -
        (char *)&TV::value
    ];
};

}

template<typename T, typename Z, Z T::*MPtr>
constexpr int offset_of() {
    return sizeof(detail::offset_helper<T, Z, MPtr>::for_sizeof);
}

The three keys in this code are:

  1. Using old-style constant expressions in the declaration of offset_helper::for_sizeof
  2. Creating our own declval equivalent since we want to use the value
  3. In the constexpr using sizeof

Friday, September 2, 2016

Even safer bitfields

Safer bitfields

I was reading “Preshing on Programming”'s entry on bitfields; he mentions the lack of runtime checks for overflow on normal bitfields, and sets about dealing with the problem by creating a template and some preprocessing glue to implement the run-time checks.

I thought this effort is valuable and interesting, the bitfield feature of C++ is very hostile to the usual template programming: For example, in a template, how do you get the total number of bits in two adjacent members of a bitfield structure? Bitfield-structures in general are not very useful. One of my fundamental tenets about how to get performance and reliability is by expressing into code sophisticated invariants, I think, for example, the size of a member of a bitfield is something that should be made available to templates, but it is not, which leads to very dangerous ad-hoc code to use them. This is another instance of the missing parts regarding introspection in C++.

John Lakos advocates offering to users your library in several versions, versions that will have the same semantics except performance so that they can trade performance for more thoroughness in the checking of their correct usage. This is very useful for development and debugging. Using the normal templates of C++ it is trivial to introduce a tiny bit of conditional compilation, with very clear semantics, to enable or disable these run-time checks. Since bitfield-structs are hostile to templates, you can’t easily turn on or off safety checks if you implement them.

Preshing’s implementation of the supporting BitFieldMember is sound for the purpose of supporting runtime safety checks, and it would be straightforward to adapt it disable the checks for no performance loss. However, there is an important guarantee absent from his preprocessing macros, that the offset of a field is exactly the sum of sizes of the preceding fields. Another feature I wanted to provide support to is more compile-time introspection (reflection).

Here is a template used to support the rest of the work in bitfields I made:

#include <array>

template<typename T, unsigned... S> struct base {
    constexpr static std::array<unsigned, sizeof...(S)> sizes = { S... };
    constexpr static unsigned displacement(unsigned ndx) {
        return ndx ? sizes[ndx - 1] + displacement(ndx - 1) : 0;
    }
    T value;
};

So far, just a template that gives you a choice for the integral used to hold the bitfield, an array of field sizes and a constexpr function, displacement, to tally the sizes of the predecessors.

We can now use Preshing’s BitFieldMember template, for example:

union manually_done_bitfield {
    using element_t = long;
    using base_t = base<long, 4, 3, 5>;
    BitFieldMember<element_t, base_t::displacement(0), 4> fourBits;
    BitFieldMember<element_t, base_t::displacement(1), 3> threeBits;
    BitFieldMember<element_t, base_t::displacement(2), 5> fiveBits;
};

The manual part of setting the bit sizes to the same as the declarations of the members, as well as the progression of indices, can be coded using boost seq preprocessing:

// Generic macros not specific to this article
#define UNTUPLE_2_1(a, b) a
#define UNTUPLE_2_2(a, b) b
#define PP_SEQ_TUPLE_2_2(s, d, element) UNTUPLE_2_2 element
#define PP_SEQ_ENUM_UNTUPLE_2_2(r, data, e) ,UNTUPLE_2_2 e

// Macros actually used from boost seq preprocessing
#include <boost/preprocessor/seq/for_each_i.hpp>
#include <boost/preprocessor/seq/for_each.hpp>

// Macro to declare a member
#define PP_SEQ_I_MAKE_BITFIELD(r, data, i, element) BitFieldMember<element_t, base_t::displacement(i), UNTUPLE_2_2 element> UNTUPLE_2_1 element;

// The actual union defining macro
#define SAFE_BITFIELD(name, type, fields)\
    union name {\
        using element_t = type;\
        using base_t = base<type BOOST_PP_SEQ_FOR_EACH(PP_SEQ_ENUM_UNTUPLE_2_2, ~, fields)>;\
        BOOST_PP_SEQ_FOR_EACH_I(PP_SEQ_I_MAKE_BITFIELD, ~, fields)\
    }

// The example above expressed as a boost seq
#define example ((fourBits, 4))((threeBits, 3))((fiveBits, 5))

SAFE_BITFIELD(automatic, long, example);

Now, beautifully, we can use static_assert to our hearts’ content:

static_assert(7 == automatic::base_t::displacement(2), "");

From here, there are ways to map the fields to their indices, etc.

Thursday, June 16, 2016

News to me of higher costs of passing by value

BOOST preprocessing guide not written yet

TL;DR: smart pointers unavoidably imply one extra indirection as arguments to non-inlined functions

Recently I was teaching a class about smart pointers where I said unique_ptr normally does not have any space overhead, and I tried to prove it by showing the assembler of dereferencing one, but the assembler was not what I expected! Later I realized there is an important cost associated with all smart pointers I did not know about, at least in the Intel/AMD64 world

Normally, when returning or passing a value of a class type, the value will be passed by registers. For example, for this code, precursor to an implementation of “yet another shared ownership pointer”, there will be a concept of a reference counted box that contains the actual value, and the smart pointer will point to that:

template<typename T> struct yasop_precursor {
    T &operator*() { return box->real_value; }
private:
    struct Box { long ref_count; T real_value; };
    Box *box;
};

long dereference(yasop_precursor<long> yp) { return *yp; }

The generated assembler for the non-inline function dereference is as expected:

dereference(yasop_precursor<long>):
    mov     rax, QWORD PTR [rdi+8]
    ret

That is, simply take the value of the function argument yp, which is a pointer, and offset it by 8 bytes because of the ref_count.

However, if the only change is to put a custom destructor:

template<typename T> struct yasop {
    T &operator*() { return box->real_value; }

    ~yasop();
private:
    struct Box { long ref_count; T real_value; };
    Box *box;
};

The assembler changes:

dereference(yasop<long>):
    mov     rax, QWORD PTR [rdi]
    mov     rax, QWORD PTR [rax+8]
    ret

Now the smart pointer is not passed by register, but by reference, (rdi points to the smart pointer, which in turns points to the reference counted box). I mean the argument is still a copy of whatever was given to the function dereference, but the function does not receive the value of the copy, but a reference to it, therefore, it requires one extra indirection to use.

Thinking about it, it almost makes sense: Normally, the ABI for passing by value is such that if the struct or class value fits in registers, it will be passed in the registers; if it is too big, then a copy will be made and the address of the copy will be passed instead. This is what we see in the first derefence for yasop_precursor: Their values only contain a box pointer, 8 bytes, that of course fits into one register, hence it is passed by register. However, if the class has a custom destructor, to be able to destroy the copy implied in passing by value, its address has got to exist. Regardless of what the destructor does. Since there will be a copy of the value (in the stack), the ABI designers decided the argument may as well be passed by reference, as stated in the specification, Chapter 3.1.1.1 first item.

The same argument applies to custom copy constructors: There is no copy constructor from a value (otherwise, the copy constructor for the argument itself would be called!), since the copy constructor must receive a reference, all values of types that have custom copy constructors will also be downgraded from passing by register into passing by reference.

There is an important note to bear: the standard allows eliding the copy of function arguments when they truly are temporaries, even if the copy constructor or destructor have side effects, that is, as if the copy never happened.

Because of this reason it is more important that functions that receive smart pointers be inlined. Also, passing by value smart pointers is of dubious benefit.

Note: What I said before does not imply there is the cost of one extra indirection for smart pointers, only when they are passed as arguments or returned from functions. The overhead of most smart pointers has been measured to be not worse than 1% of runtime in real life applications.

Wednesday, January 6, 2016

Achieving the ultimate date continued, good interpretations

Hello google wanderer, let me disappoint you because once more this article, although it has to do with speed dates, won't help you any to find your soul mate, and especially not how to understand her/him...

Ah, BOOST preprocessing, as explained before, will be postponed once more.

Another inexpressible thing


It turns out there is no way to tell a C++ compiler that it can see the same bits as two different types of values. reinterpret_cast and C-style casts that reinterpret the value are simply not allowed in constexprs. But we have unions, right? For example, I thought for this code
union bits64 {
    long integer;
    double floating;

    constexpr bits64(double init) noexcept: floating(init) {}
    constexpr bits64(long init) noexcept: integer(init) {}
};
that integer and floating could be used indistinctly; furthermore, for ages my code has done exactly that to significant gains of clarity and performance. However, what that code really means is you can set either and read the same; if you last set one and read the other, that’s undefined behavior. IN YOUR FACE!. In particular, for constexprs, where the compiler can track which is the last member set, the so-called active member, it blows into an error. I am discovering that unions are discriminated. If you need persuasion that this is the case, check this code:
union bits64 {
    long integer;
    double floating;

    constexpr bits64(double init): floating(init) {}
    constexpr bits64(long init): integer(init) {}

    constexpr operator long() const { return integer; }
};

constexpr bits64 fromLong{long(0)};

static_assert(0 == long(fromLong), "");

constexpr bits64 fromDouble = 0.0;
static_assert(0 == long(fromDouble), "");
There will be a compilation error, accessing fromDouble.integer when the compiler is sure that the last union member set (the active member) is other than the one being accessed.

I don’t know what is there to gain by such definition of unions. I mean, there are things that come to mind, we could put an addendum to tools such as valgrind to keep track of the last union member set and track its usage.  That’s all good, but it does not require a language feature to force it; however, what about the legitimate use of indiscriminated unions?

There is a good example for wanting to use an alternative interpretation for the same bits: A date type such as the one described above as trivial conversions to and from integer, if the user does not care about endianness:
union date_record {
    struct {
        uint8_t char day, month;
        int16_t year;
    };
    int32_t integer;

    constexpr bool operator<(date_record other) const noexcept {
        return integer < other.integer;
    }
};

bool compare(date_record dr) {
    return dr.day + dr.month * 256 + dr.year == dr.as_integer;
}
compare will always be true in little endian arithmetic (such as AMD64). This fact can be used to simplify many operations. I implemented an integer comparison which under the assumption of little endianness is unbeatable in terms of economy of programming effort, clarity, and runtime performance; precisely the kind of things a good language should encourage.

Using an alternative interpretation for the same bits through the union feature, is very well defined "undefined behavior" in all compilers I’ve ever used.

Let us see another example, a date increment by one month operation (this implementation is not complete because adding one month to a January 30 needs to change the day component too, however, for simplicity of exposition):
union date_record {
    struct {
        uint8_t char day, month;
        int16_t year;
    };
    int32_t integer;

    constexpr bool operator<(date_record other) const noexcept {
        return integer < other.integer;
    }
};

date_record incrementMonthAggregate(date_record dr) {
    if(dr.month < 12) { ++dr.month; }
    else { dr.month = 0; ++dr.year; }
    return dr;
}

date_record incrementMonthInteger(date_record dr) {
    date_record rv;
    rv.integer =
        dr.month < 12 ?
        dr.integer + 256 :
        (dr.integer & 0xFFFF00FF) + 0x10000;
    return rv;
}
Notice both functions do essentially the same work, so, programming one, even for constexprs would make the other superflous except for the apparently arbitrary rule that unions are discriminated. But since that rule is in the language, gcc honors it and I don’t know how to turn it off, then if full compile time reasoning is desired, there is no recourse but to implement all the functions as many times as there are alternative interpretations. I thought that was the end of it, and I had my article nicely tied up.

However, being thorough, I checked the assember. It turns out they are sufficiently different. In gcc, 20 assembler lines versus 8, in clang, 18 versus 12 --notice clang’s assembler for the integer operation, despite being slightly longer, is superior to gcc’s because it does not have a conditional jump–. It took me a while to figure out what’s the difference (GCC case):
incrementMonthAggregate(date_record):
        movl    %edi, %edx ;; edx <- dr
        movl    %edi, %esi ;; esi <- dr
        movzbl  %dh, %ecx  ;; ecx <- dr.month
        sarl    $16, %esi  ;; esi = dr.year
        cmpb    $11, %cl   ;; if(dr.month <= 11)
        movl    %esi, %eax ;;
        jbe     .L5        ;; { eax <- dr.year; goto .L5 }
        leal    1(%rsi), %eax ;; eax <- 1 + dr.year
        xorl    %ecx, %ecx ;; ecx <- 0
        movb    %cl, %dh   ;; dr.month = 0
        sall    $16, %eax  ;; eax <- (1 + dr.year) << 16
        movzwl  %dx, %edx  ;; edx <- dr.month * 256 + dr.day
        orl     %edx, %eax ;; eax <- dr
        ret
.L5:
        addl    $1, %ecx   ;; ecx <- 1 + dr.month
        sall    $16, %eax  ;; eax <- dr.year << 16
        movb    %cl, %dh
        movzwl  %dx, %edx  ;; edx <- dr.month * 256 + dr.day
        orl     %edx, %eax ;; eax <- dr
        ret
incrementMonthInteger(date_record):
        movq    %rdi, %rax
        cmpb    $11, %ah
        jbe     .L10
        andl    $-65281, %edi
        leal    65536(%rdi), %eax
        ret
.L10:
        leal    256(%rdi), %eax
        ret
Contrary to my expectations, quite simply all three compilers (intel’s, gcc, clang) fail to de-aggregate the day, month, year representation into a consolidated integer, so, they do the superfluous work of splitting the value into components and resynthetizing it.

Now that we have been informed that there actually is a runtime performance difference, we have to make sure we use the best interpretaton for each operation. I wish the compiler would let me use the interpretation I see as easiest to implement an operation and then give me the equivalent for the best interpretation, because I am never in the mood to do the compiler’s job while programming, but can’t anymore, there is performance to be gained.

In summary, discriminated unions mean that in practice the compile time functions have to respect which is the active member, leading to programming several versions of exactly the same operations, and more complications to indicate to the end programmer which ones to use for the runtime. IMO, this is not acceptable.

Tuesday, January 5, 2016

Achieving the ultimate date

If you stumbled upon this article from a search engine, sorry to disappoint, but this is not about dating, nor how to find your soul mate. It is about the puerile pursuit of the ultimate implementation for computer program values that hold dates.

The intellectual fertility of implementing the ultimate Date class


In the article lists of function call arguments we introduced an idiom to configure a class.

Part of the motivation for the technique is that lists of function call arguments are not expressible directly into C++, especially the very useful case of partial lists of parameters; thus, we wanted to explore for the special case of boolean parameters whether it was feasible to pack them into a C++ type. Fortunately, we saw that it was possible, without a loss of performance, and potentially a performance gain. As a matter of fact, now I recommend not using boolean parameters at all, but to group them into their own set of flags. To further make the idiom practical, I introduced supporting BOOST preprocessing macros to take care of the boilerplate.

In my outgoing remarks I mentioned in passing that the technique could be generalized to arbitrary lists of function call arguments, provided their types respected value semantics.
 
This is one aspect that has kept me busy researching. The technique works, however, implementing it, and its supporting BOOST preprocessing macros, I encountered obstacles of different natures:
  1. The compilers for reasons not known to me translate essentially the same code in different ways
  2. The language has a few rules that conspire against highly performing and easy to understand code
  3. I reached hard limitations of BOOST preprocessing
  4. Some compilers may fail to perform optimizations that GCC can.
I thought a class for dates was simple enough for illustration purposes and non-trivial enough that the idioms would be applicable.

Offline a friend pointed out the deficiency of the idiom presented of boolean flags of not being able to force a complete set of flags to initialize a class. That is, the expression of complete sets of flags, in the example we have been using, to construct a Font only if the FontOptions has all its flags set. This problem has a solution that I suspect is original.

Generalizing the conversion of boolean parameters into a consolidated aggregate type for any value-semantic argument list turned into a through and through research project, because the straightforward solution, which is expressible, still leads to very interesting aspects. In that regard, it seems the conceptualization I arrived at is at the leading edge of the capabilities of the language, compilers and supporting techniques. This makes for very interesting writing material.

In this article, I’ve decided to simply introduce the aspects, to be explained in depth in successive articles. Yet again, I am forced to postpone explaining BOOST preprocessing, actually, I am afraid this time I had to use BOOST preprocessing techniques I don’t understand too well, and even beyond that, there are techniques that I don’t look forward to explaining because they are too weird, but I'll try.

 

Complete flag sets


Let us start with recapitulating the solution for the sets of options to create a font:

#include <boost/preprocessor/seq/for_each_i.hpp>
#include <boost/preprocessor/cat.hpp>

#define PP_SEQ_I_DECLARE_ENUMERATION_VALUE(r, dummy, i, e) e = (1 << i),
#define PP_SEQ_I_FLAG_SET_MEMBER_FUNCTIONS(r, name, i, e)\
    constexpr name e() const noexcept { return value | internal::e; }\
    constexpr name BOOST_PP_CAT(not_, e)() const noexcept { return value & ~internal::e; }\
    constexpr bool BOOST_PP_CAT(is_, e)() const noexcept { return value & internal::e; }

#define BOOLEAN_FLAG_SET(name, flags)\
    class name {\
        struct internal { enum: unsigned char {\
            EMPTY = 0,\
            BOOST_PP_SEQ_FOR_EACH_I(PP_SEQ_I_DECLARE_ENUMERATION_VALUE, ~, flags)\
        }; };\
        unsigned char value = 0;\
        constexpr name(unsigned v) noexcept: value(v) {}\
    public:\
        name() = default;\
        name(const name &) = default;\
        BOOST_PP_SEQ_FOR_EACH_I(PP_SEQ_I_FLAG_SET_MEMBER_FUNCTIONS, name, flags)\
        constexpr bool operator==(name other) const noexcept { return value == other.value; }\
        constexpr bool operator!=(name other) const noexcept { return !(*this == other); }\
    }

We can have a class Font that requires a FontOptions to be constructed, like this:

BOOLEAN_FLAG_SET(FontOptions, (bold)(italic)(underlined)(strike_through));

class Font {
    public:

    Font() = default;
    Font(const Font &) = default;

    inline constexpr Font(FontOptions fo) noexcept;
};
 

Power set covariant return type template


This example suggest that the only way to make a new Font is by first having a FontOptions. However, there is nothing to prevent a Font being constructed with FontOptions not completely specified. This may be an important feature in other examples. Without further ado, I present the solution I came up with, which potentially is a new idiom that I will call “the power-set covariant return type template”, first, in its hand coded form:

namespace zoo {
    
template<unsigned FlagsSet> class FOI {
protected:
    enum Flags { BOLD = 1, ITALIC = 2, UNDERLINED = 4, STRIKE_THROUGH = 8 };
    unsigned char flagsSet = 0;

    explicit constexpr FOI(unsigned integer) noexcept: flagsSet(integer) {}

    constexpr unsigned flagVal(bool v, unsigned flag) const noexcept {
        return v ? (flagsSet | flag) : (flagsSet & ~flag);
    }

    template<unsigned> friend class FOI;

public:
    FOI() = default;
    FOI(const FOI &) = default;

    constexpr FOI<FlagsSet | BOLD> bold(bool v) const noexcept {
        return FOI<FlagsSet | BOLD>(flagVal(v, BOLD));
    }

    constexpr FOI<FlagsSet | ITALIC> italic(bool v) const noexcept {
        return FOI<FlagsSet | ITALIC>(flagVal(v, ITALIC));
    }

    constexpr FOI<FlagsSet | UNDERLINED> underlined(bool v) const noexcept {
        return FOI<FlagsSet | UNDERLINED>(flagVal(v, UNDERLINED));
    }

    constexpr FOI<FlagsSet | STRIKE_THROUGH> strike_through(bool v) const noexcept {
        return FOI<FlagsSet | STRIKE_THROUGH>(flagVal(v, STRIKE_THROUGH));
    }

    constexpr bool getBold() const noexcept { return flagsSet & BOLD; }
    constexpr bool getItalic() const noexcept { return flagsSet & ITALIC; }
    constexpr bool getUnderlined() const noexcept { return flagsSet & UNDERLINED; }
    constexpr bool getStrikeThrough() const noexcept { return flagsSet & STRIKE_THROUGH; }
};

}

template<unsigned fs> using FontOptions = zoo::FOI<fs>;

struct Font {
    bool b, i, u, s;
    Font(FontOptions<15> o): b(o.getBold()), i(o.getItalic()), u(o.getUnderlined()), s(o.getStrikeThrough()) {}
};

Font makeFont() {
    return  FontOptions<0>().bold(true).underlined(false).italic(true).strike_through(true);
}

Go ahead and try to build a Font from less than the result of initializing all the flags, the compiler will reject the code.

The technique before lets you:
  1. To not lose any performance (if in doubt, construct a test case and see the assembler)
  2. Full battery of compile-time reasoning about the values
  3. To set the flags by name
  4. Regardless of order
  5. To have partial sets of flags and communicate them all over
  6. The hability to enforce that all the flags are needed, simply require the signature with the number that represents all bits set.
  7. Performance gains from keeping the set consolidated
In summary, at the same time greater clarity and performance.

The name “power-set covariant return type” stems for the fact that the return type of the accessors in effect traverse the power set of flags. Remember, the power set of a set is the set of all of its subsets. That is, the return types typically begin with the empty set (no flag has been set), and progress adding any flag to the current set; the possible ways are all of the power set.

Now, let us add macros to not repeat the same boilerplate code:
 
#define PP_SEQ_I_DECLARE_ENUMERATION_VALUE(r, dummy, i, e) e = (1 << i),
#define PP_SEQ_I_POWER_SET_SET_FLAG_FUNCTION(r, name, i, e)\
    constexpr name<set | internal::e> e(bool v) const noexcept { return flagVal(v, internal::e); }\
    constexpr bool BOOST_PP_CAT(get_, e)() const noexcept { return value & internal::e; }

#define POWER_SET_COVARIANT_RETURN_TYPE_FLAG_SET(name, flags)\
    template<unsigned set> class name {\
        struct internal {\
            enum {\
                BOOST_PP_SEQ_FOR_EACH_I(\
                    PP_SEQ_I_DECLARE_ENUMERATION_VALUE,\
                    ~,\
                    flags\
                )\
            };\
        };\
        unsigned char value = 0;\
        constexpr name(unsigned v) noexcept: value(v) {}\
        constexpr unsigned flagVal(bool v, unsigned flag) const noexcept {\
           return v ? (value | flag) : (value & ~flag);\
        }\
        template<unsigned> friend struct name;\
    public:\
        name() = default;\
        name(const name &) = default;\
        BOOST_PP_SEQ_FOR_EACH_I(PP_SEQ_I_POWER_SET_SET_FLAG_FUNCTION, name, flags)\
    }

 

But it is just a generalization!


So far, the language, compiler and preprocessor are cooperating with the idioms. Let’s now generalize the idiom of transforming the non-expressible concept of boolean argument list (potentially partial) into a configuration set for arbitrary types.

First, the input can’t be just a list of identifiers, because the members will have arbitrary types. This generalization won’t be for the uninteresting case of the members having the same type. Since the list will be pairs of name and type, and because in preprocessing it does not make a difference whether they are pairs, triplets, or larger tuples, let us make them triplets: the name of the internal member variable, the type, and the public interface accessor.

The example I’ve chosen to illustrate the generalization is that of a date, expressed either as an integer or as an aggregate of day, month, year. You’ll see below I first exemplify the technique by hand-coding an example, and then, a fairly complicated set of macros that generate the same thing. Notice that to be fair, I am also using union to compare the assembler generated to the case in which an alternate representation makes sense.

Again, the code will be published without further explanation.

Investigating on date representations such as these I stumbled upon interesting phenomena, which is not related to preprocessing. Since I do not want to bore the audience with too many preprocessing articles in a row, the next article will be on that subject matter.

#include <boost/preprocessor/seq/for_each.hpp>
#include <boost/preprocessor/seq/for_each_i.hpp>
#include <boost/preprocessor/tuple/elem.hpp>
#include <boost/preprocessor/comparison/equal.hpp>
#include <boost/preprocessor/control/iif.hpp>
#include <boost/preprocessor/comma_if.hpp>

namespace pp {

enum Month: unsigned char {
    JANUARY, FEBRUARY, JULY
};

#define PP_SEQ_MEMBER_OPERATOR(r, representative, op)\
    constexpr bool operator op(const self_type &arg) const noexcept { return representative(*this) op representative(arg); }

#define PP_COMPARABLE_OPERATIONS(type, representative)\
    using self_type = type;\
    BOOST_PP_SEQ_FOR_EACH(PP_SEQ_MEMBER_OPERATOR, representative, (==)(!=)(<)(<=)(>)(>=))

#define PP_EXPAND(...) __VA_ARGS__
#define MEMBERF(return_name, args) constexpr PP_EXPAND return_name (PP_EXPAND args) const noexcept

struct DateRecord {
    DateRecord() = default;
    DateRecord(const DateRecord &) = default;
    DateRecord &operator=(const DateRecord &) = default;

    constexpr operator int() const noexcept {
        using u8 = unsigned char;
        return d_ + (u8(m_) << 8) + (y_ << 16);
    }

    PP_COMPARABLE_OPERATIONS(DateRecord, int)
protected:
    union {
        struct {
            unsigned char d_;
            Month m_;
            short y_;
        };
        int integer;
    };

    constexpr DateRecord(unsigned char d, Month m, short y) noexcept:
        d_(d), m_(m), y_(y)
    {}
};

struct Date;

struct DI: DateRecord {
    using DateRecord::DateRecord;

    constexpr DI year(short y) const noexcept { return DI(d_, m_, y); }
    constexpr DI month(Month m) const noexcept { return DI(d_, m, y_); }
    constexpr DI day(unsigned char d) const noexcept { return DI(d, m_, y_); }

    constexpr inline Date date() const noexcept;
};

struct Date: DateRecord {
    Date() = default;
    Date(const Date &) = default;

    constexpr Date(DI initializer) noexcept: DateRecord(initializer) {}
};

constexpr Date DI::date() const noexcept{
    return Date(*this);
}

constexpr Date Apollo11Landing = DI().month(JULY).day(22).year(1969);
constexpr auto UnixEpoch = DI().year(1970).month(JANUARY).day(1).date();

static_assert(Apollo11Landing < UnixEpoch, "No!");

}

// The following six macros are "framework" style macros that perform
// useful work in many situations
#define PP_SEQ_METACALL(r, macro, e) macro e
#define PP_SEQ_I_ENUM_METACALL(r, macro, i, e) BOOST_PP_COMMA_IF(i) macro e

#define PP_EXPAND(...) __VA_ARGS__
#define PP_EMPTY()
#define PP_DEFER(...) __VA_ARGS__ PP_EMPTY()
#define PP_DEFERRED_BOOST_PP_SEQ_FOR_EACH_I() BOOST_PP_SEQ_FOR_EACH_I

// Now, the macros specific to the implementation of this idiom
#define PP_IMPL_DECLARE_IDENTIFIER(identifier, type, name) type identifier

/// Used for initializer lists to define constructors
#define PP_IMPL_INITIALIZER(name, type, fullname) name(name)

/// Simply add a semicolon to the declaration
#define PP_IMPL_DECLARE_MEMBER(name, type, fullname) PP_IMPL_DECLARE_IDENTIFIER(name, type, fullname);

/// Implementation detail of PP_SEQ_I_PSEUDO_CONSTRUCTOR: fragment of the setter signature
#define PP_IMPL_PSEUDO_CONSTRUCTOR_FRAGMENT(name, type, fullname) fullname(type arg) const noexcept

/// Converts the given tuple into an argument copy or \c arg depending on
/// Whether the index is the same
#define PP_IMPL_SEQ_I_PSEUDO_COPY(r, index, i, e)\
    BOOST_PP_COMMA_IF(i) BOOST_PP_IIF(BOOST_PP_EQUAL(index, i), arg, BOOST_PP_TUPLE_ELEM(3, 0, e))

/// Copies all the members except the member with the given index, which will be \c arg
#define PP_IMPL_PSEUDO_COPY(members, index)\
    PP_DEFER(PP_DEFERRED_BOOST_PP_SEQ_FOR_EACH_I)()(PP_IMPL_SEQ_I_PSEUDO_COPY, index, members)

/// This is the setter that returns a copy modified
#define PP_IMPL_SEQ_I_PSEUDO_CONSTRUCTOR(r, name_members, i, e)\
    constexpr BOOST_PP_TUPLE_ELEM(2, 0, name_members) PP_IMPL_PSEUDO_CONSTRUCTOR_FRAGMENT e {\
        return BOOST_PP_TUPLE_ELEM(2, 0, name_members) (\
            PP_IMPL_PSEUDO_COPY(BOOST_PP_TUPLE_ELEM(2, 1, name_members), i)\
        );\
    }

// Now, macros intended to be used by the end programmer
#define PP_IMPL_DECLARE_MEMBERS(members)\
    BOOST_PP_SEQ_FOR_EACH(PP_SEQ_METACALL, PP_IMPL_DECLARE_MEMBER, members)

#define PP_VALUES_CONSTRUCTOR(name, members)\
    constexpr name(BOOST_PP_SEQ_FOR_EACH_I(PP_SEQ_I_ENUM_METACALL, PP_IMPL_DECLARE_IDENTIFIER, members)) noexcept:\
    BOOST_PP_SEQ_FOR_EACH_I(PP_SEQ_I_ENUM_METACALL, PP_IMPL_INITIALIZER, members) {}

#define PP_PSEUDO_CONSTRUCTORS(name, members)\
    PP_EXPAND(BOOST_PP_SEQ_FOR_EACH_I(PP_IMPL_SEQ_I_PSEUDO_CONSTRUCTOR, (name, members), members))

#define PP_CONFIGURATOR(name, members)\
    class name {\
        protected:\
        PP_IMPL_DECLARE_MEMBERS(members)\
        PP_VALUES_CONSTRUCTOR(name, members)\
        public:\
        name() = default;\
        name(const name &) = default;\
        PP_PSEUDO_CONSTRUCTORS(name, members)\
    }

#define MEMBERS ((d_, unsigned char, day))((m_, pp::Month, month))((y_, short, year))
PP_CONFIGURATOR(dateI, MEMBERS);

struct date: dateI {
    date() = default;
    date(const date &) = default;

    constexpr date(dateI d) noexcept: dateI(d) {}

    MEMBERF((operator int),()) {
        using u8 = unsigned char;
        return d_ + (u8(m_) << 8) + (y_ << 16);
    }

    PP_COMPARABLE_OPERATIONS(date, int)
};

constexpr date Apollo11Landing = dateI().month(pp::JULY).day(22).year(1969);
constexpr date UnixEpoch = dateI().year(1970).month(pp::JANUARY).day(1);

static_assert(Apollo11Landing < UnixEpoch, "No!");

constexpr dateI dt = dateI().month(pp::JULY).day(22).year(1969);

PP_CONFIGURATOR(
    fulltime,
    ((n_, unsigned, nanos))
    ((s_, unsigned short, seconds))
    ((m_, unsigned char, minutes))
    ((h_, unsigned char, hours))
    ((d_, unsigned char, day))
    ((mo, pp::Month, month))
    ((y_, short, year))
);

constexpr fulltime ft = fulltime().nanos(2134).hours(3).minutes(22).month(pp::FEBRUARY).year(2015).seconds(33).day(3);

fulltime pairoft[2];
static_assert(sizeof(pairoft) == 24, "");

Tuesday, December 29, 2015

The path to the innermost nature of things: metaprogramming

Metaprogramming

I once had the privilege of attending a presentation by Stephen Dewhurst on the subject of metaprogramming. It is from this event that I learned metaprogramming was not designed into C++ but discovered. That is, our predecessors trying to squeeze performance out of the language developed idioms that would have the compiler go beyond merely translating to object code, to also perform useful computation and insert the results to the object code, so that they would not have to be computed at run time; these techniques grew and became a new way to program; C++'s type system is so powerful that it is “turing complete”, that is, it can be used to program anything. This does not mean that it is practical, especially not that it is easy to do it.

Expressing relationships

There are times when meta programming is helpful, not just to bring to compilation time computation that otherwise would be done at run time, but computation that helps improve software quality. Most software elements have many relationships to other software elements, better programming languages allows the expression and maintenance of more and more abstract of those relationships. It is this interest, of improving software quality by expressing sophisticated relationships between software elements, what is my main interest in metaprogramming. I have introduced BOOST preprocessing in the last two articles illustrating techniques whose aim is to express relationships between software elements that can not be expressed within the pure language, I thought on reviewing those articles that I should have begun with relationships that can be expressed, hence I allow myself to postpone once again the description of how to use boost preprocessing.
By way of example, let us discuss the data structure of the singly linked list. Not caring about the operations, the data layout may be:
template<typename T> struct SListNode { T value; SListNode *link; };
struct SList { SListNode *head; unsigned size; };
However, this can be wasteful. In today’s architectures a pointer is typically 8 bytes long, if the element type T is as small as a single byte, because of very important alignment considerations, the size of the node will be 16 bytes. That’s an overhead of 16 to 1…

Type inferences

Can this be implemented better? Most architectures will have the same performance dereferencing a pointer as dereferencing a pointer plus an offset. That is, *pointer performs as well as *(pointer + offset). This is the case on the AMD64 architectures, and in Intel’s 32 bit and AMD64, there is another mode of addressing with the same performance, the “Scale Index Base”: *(pointer + index*element_size + offset) is as performing as *pointer provided the element_size is any of 1, 2, 4, or 8. Incidentally, in intel32/AMD64, the fastest way to multiply by a constant such as 5, (or 3 or 9) is to “load the effective address” of the operand register + 4*register, for example, lea (rax + 4*rax), rbx.
If we are not going to hold more than 4 giga elements, using a pointer to discriminate is wasteful, so it may be practical to create a memory arena for our lists, just a large array of bytes. Then, our node could be represented like this:
template<typename T> struct SListNode {
    static SListNode *arena; ///< a class-member common to all lists of this type T
    T value;
    unsigned arenaIndexOfNext;

    SListNode &next() { return arena[arenaIndexOfNext]; }
    unsigned currentIndex() const noexcept { return this - arena; }
};
If the total number of nodes will be less than 2^16, thus we won’t have more than 2^16 indices, and the size of the type T is 2 bytes, our current implementation of using an unsigned (typically 4 bytes) forces the node to be of 8 bytes when it really requires 4; where I am going with this line of reasoning is that the optimal type used for the index is dependent on the size of the type T and the maximum number of nodes to be supported. Conceivably, we can manually implement the structure for each possible combination:
  1. 1 byte element, up to 256 nodes: use a byte for index
  2. up to 2 byte elements and less than 2^16 indices: use a two-byte for the index
  3. up to 4 byte elements use a four-byte index
  4. more than 4 byte elements use a pointer

Generic programming

And this is in the case we want to keep the value and the link together, if we chose to have separate arenas for the values and the links represented as indices, we have the whole cartesian product of sizes: for the element type size < 2, < 4, < 8, >= 8 and the total number of elements, < 2^8, < 2^16, > 2^16 or 12 different cases that are essentially the same code, just substituting the appropriate types and constants. The language has templates, both with type parameters and non-type parameters to accomplish coding the mechanism (singly linked list through arenas and indices). In that sense it is generic, in which the implementation may not care about the details of the parameters.
To use the generic implementation the programmer/user still has to make a choice: what is the type of the index that best suits the number of possible indices and the type of the elements; as explained in the preceding list, there is a clear way to select the right index. Since both the element type and the maximum number of indices are not known when writing the code to select the right index type, in effect that code will reason about elements that will be given to the program at a later time, that is, they work with elements of the program itself, hence metaprogramming.
So, generic is when the specific types (or constants) don’t matter; meta is when the parameters to templates will be decided based off other types the programmer gives, of course that each are the other to some degree. Alexander Stepanov, the person with the largest share of the credit for the invention of Generic Programming said in this zero-waste interview that he “[believes] it is necessary to design a programming language from the ground up to enable generic programming in a consistent way”. It is safe to say that the higher one goes into generic abstraction, that is, into proper meta-programming territory, the inconsistencies become much harder to overcome.
It begs the question, is it worth at all fighting this accidental complexity on top of the already huge inherent complexity? I would say yes, it is, because the inherent complexity (being able to program in a generic way) is the ultimate way to make software, to make something generic, one needs to understand it to its core, thus the inherent complexity of generic programming is how to capture the innermost nature of what’s being programmed. With regards to the accidental complexity, I draw from my personal experience in that I have had to exercise myself fully into getting around the inconsistencies; I have used techniques of functional programming, of logic programming, of artificial intelligence, and mathematics, more mathematics and even more mathematics. I have reached a much higher understanding of what programming is all about thanks to my interest in trying to accomplishing genericity. Surprisingly, I have actually succeeded at getting great practical results after it all. Think about all of the frustrations that needed overcoming to program the STL so that it could be approved into the 1998 standard, but then look at the monumental accomplishment… just to name one example, C++, to my knowledge, is the only language that has a managed array type (std::vector) with essentially zero performance overhead over unmanaged arrays… and in the 17 years and counting since it has been part of the standard, all other languages still “don’t get it”, for example, Java, C# have “generics” which are but pathetic syntactic sugar.
That is, the fighting itself has been an excellent school.
I find the subject so fascinating that it motivates me to make community out of sharing our findings.
Back to the pedestrian dydactic example, here’s where meta programming is helpful:
  1. lets the programmer express the relationships between the element size and the index size (or whether to use pointers),
  2. lets the compiler choose the implementation case that best suits the size of the type and the maximum number of elements,
  3. relieves the programmer from the responsability to choose the index (and thus the chance for her/him to make a poor choice)
  4. keeps the option for further refinements
I trust the reader to be able to program the generic arena solution given the optimal type for the indices, and being able to specialize for the case in which it makes more sense to use the standard node structure. For dydactical purposes, this is what will give the adequate index type for the arena implementation in which the nodes are the composition of the element and the index. The template alias meta::Index that takes a template type parameter (what is going to be held in lists) and the maximum number of nodes for lists of that type; this is defined below:
namespace meta {
    template<unsigned> struct type_size { using type = unsigned long; };
    template<> struct type_size<4> { using type = unsigned; };
    template<> struct type_size<2> { using type = short unsigned; };
    template<> struct type_size<1> { using type = unsigned char; };

    constexpr unsigned powerOf(unsigned base, unsigned exponent) noexcept {
        return exponent ? base*powerOf(base, exponent - 1) : 1;
    }

    constexpr unsigned logCeiling(unsigned base, unsigned v) noexcept {
        return v <= 1 ? 0 : 1 + logCeiling(base, (v + base - 1)/base);
    }

    template<typename T> constexpr T max(T a, T b) { return a < b ? b : a; }

    template<typename ElementType, unsigned MaxIndex> using Index =
        typename type_size<
            powerOf(
                2,
                max(
                    logCeiling(2, sizeof(T)),
                    logCeiling(256, Max)
                )
            )
        >::type;
}
Just a few definitions and the compiler will be able to pick the right index type. Provided you have an implementation of a heap manager, you’d have reduced the space overhead of containing elements in lists to its optimum. Since indexing as opposed to direct dereferencing has no penalty, the net effect of generic programming plus metaprogramming are
  1. Generics make it so that two implementations (normal single linked list and arenas) reduce the chances for errors of doing things manually or the different varieties to drift apart
  2. The space overhead is minimal,
  3. the increase in space efficiency translates into net speedup due to more effective caching
  4. No other performance penalty
  5. Increased compilation time
In essence, metaprogramming has been used in this example to express the relationship between the optimal index type to the element type and the maximum number of nodes that will be supported. The benefit or the importance of being able to code this relationship is trivial, however, the relationships that can be expressed with the mechanisms of metaprogramming are of arbitrary complexity, and they lead to truly great benefits.