The C++ Zoo: 2015

Tuesday, December 29, 2015

The path to the innermost nature of things: metaprogramming

Metaprogramming

I once had the privilege of attending a presentation by Stephen Dewhurst on the subject of metaprogramming. It is from this event that I learned metaprogramming was not designed into C++ but discovered. That is, our predecessors trying to squeeze performance out of the language developed idioms that would have the compiler go beyond merely translating to object code, to also perform useful computation and insert the results to the object code, so that they would not have to be computed at run time; these techniques grew and became a new way to program; C++'s type system is so powerful that it is “turing complete”, that is, it can be used to program anything. This does not mean that it is practical, especially not that it is easy to do it.

Expressing relationships

There are times when meta programming is helpful, not just to bring to compilation time computation that otherwise would be done at run time, but computation that helps improve software quality. Most software elements have many relationships to other software elements, better programming languages allows the expression and maintenance of more and more abstract of those relationships. It is this interest, of improving software quality by expressing sophisticated relationships between software elements, what is my main interest in metaprogramming. I have introduced BOOST preprocessing in the last two articles illustrating techniques whose aim is to express relationships between software elements that can not be expressed within the pure language, I thought on reviewing those articles that I should have begun with relationships that can be expressed, hence I allow myself to postpone once again the description of how to use boost preprocessing.
By way of example, let us discuss the data structure of the singly linked list. Not caring about the operations, the data layout may be:

template<typename T> struct SListNode { T value; SListNode *link; };
struct SList { SListNode *head; unsigned size; };

However, this can be wasteful. In today’s architectures a pointer is typically 8 bytes long, if the element type T is as small as a single byte, because of very important alignment considerations, the size of the node will be 16 bytes. That’s an overhead of 16 to 1…

Type inferences

Can this be implemented better? Most architectures will have the same performance dereferencing a pointer as dereferencing a pointer plus an offset. That is, *pointer performs as well as *(pointer + offset). This is the case on the AMD64 architectures, and in Intel’s 32 bit and AMD64, there is another mode of addressing with the same performance, the “Scale Index Base”: *(pointer + index*element_size + offset) is as performing as *pointer provided the element_size is any of 1, 2, 4, or 8. Incidentally, in intel32/AMD64, the fastest way to multiply by a constant such as 5, (or 3 or 9) is to “load the effective address” of the operand register + 4*register, for example, lea (rax + 4*rax), rbx.
If we are not going to hold more than 4 giga elements, using a pointer to discriminate is wasteful, so it may be practical to create a memory arena for our lists, just a large array of bytes. Then, our node could be represented like this:

template<typename T> struct SListNode {
    static SListNode *arena; ///< a class-member common to all lists of this type T
    T value;
    unsigned arenaIndexOfNext;

    SListNode &next() { return arena[arenaIndexOfNext]; }
    unsigned currentIndex() const noexcept { return this - arena; }
};

If the total number of nodes will be less than 2^16, thus we won’t have more than 2^16 indices, and the size of the type T is 2 bytes, our current implementation of using an unsigned (typically 4 bytes) forces the node to be of 8 bytes when it really requires 4; where I am going with this line of reasoning is that the optimal type used for the index is dependent on the size of the type T and the maximum number of nodes to be supported. Conceivably, we can manually implement the structure for each possible combination:

1 byte element, up to 256 nodes: use a byte for index
up to 2 byte elements and less than 2^16 indices: use a two-byte for the index
up to 4 byte elements use a four-byte index
more than 4 byte elements use a pointer

Generic programming

And this is in the case we want to keep the value and the link together, if we chose to have separate arenas for the values and the links represented as indices, we have the whole cartesian product of sizes: for the element type size < 2, < 4, < 8, >= 8 and the total number of elements, < 2^8, < 2^16, > 2^16 or 12 different cases that are essentially the same code, just substituting the appropriate types and constants. The language has templates, both with type parameters and non-type parameters to accomplish coding the mechanism (singly linked list through arenas and indices). In that sense it is generic, in which the implementation may not care about the details of the parameters.
To use the generic implementation the programmer/user still has to make a choice: what is the type of the index that best suits the number of possible indices and the type of the elements; as explained in the preceding list, there is a clear way to select the right index. Since both the element type and the maximum number of indices are not known when writing the code to select the right index type, in effect that code will reason about elements that will be given to the program at a later time, that is, they work with elements of the program itself, hence metaprogramming.
So, generic is when the specific types (or constants) don’t matter; meta is when the parameters to templates will be decided based off other types the programmer gives, of course that each are the other to some degree. Alexander Stepanov, the person with the largest share of the credit for the invention of Generic Programming said in this zero-waste interview that he “[believes] it is necessary to design a programming language from the ground up to enable generic programming in a consistent way”. It is safe to say that the higher one goes into generic abstraction, that is, into proper meta-programming territory, the inconsistencies become much harder to overcome.
It begs the question, is it worth at all fighting this accidental complexity on top of the already huge inherent complexity? I would say yes, it is, because the inherent complexity (being able to program in a generic way) is the ultimate way to make software, to make something generic, one needs to understand it to its core, thus the inherent complexity of generic programming is how to capture the innermost nature of what’s being programmed. With regards to the accidental complexity, I draw from my personal experience in that I have had to exercise myself fully into getting around the inconsistencies; I have used techniques of functional programming, of logic programming, of artificial intelligence, and mathematics, more mathematics and even more mathematics. I have reached a much higher understanding of what programming is all about thanks to my interest in trying to accomplishing genericity. Surprisingly, I have actually succeeded at getting great practical results after it all. Think about all of the frustrations that needed overcoming to program the STL so that it could be approved into the 1998 standard, but then look at the monumental accomplishment… just to name one example, C++, to my knowledge, is the only language that has a managed array type (std::vector) with essentially zero performance overhead over unmanaged arrays… and in the 17 years and counting since it has been part of the standard, all other languages still “don’t get it”, for example, Java, C# have “generics” which are but pathetic syntactic sugar.
That is, the fighting itself has been an excellent school.
I find the subject so fascinating that it motivates me to make community out of sharing our findings.
Back to the pedestrian dydactic example, here’s where meta programming is helpful:

lets the programmer express the relationships between the element size and the index size (or whether to use pointers),
lets the compiler choose the implementation case that best suits the size of the type and the maximum number of elements,
relieves the programmer from the responsability to choose the index (and thus the chance for her/him to make a poor choice)
keeps the option for further refinements

I trust the reader to be able to program the generic arena solution given the optimal type for the indices, and being able to specialize for the case in which it makes more sense to use the standard node structure. For dydactical purposes, this is what will give the adequate index type for the arena implementation in which the nodes are the composition of the element and the index. The template alias meta::Index that takes a template type parameter (what is going to be held in lists) and the maximum number of nodes for lists of that type; this is defined below:

namespace meta {
    template<unsigned> struct type_size { using type = unsigned long; };
    template<> struct type_size<4> { using type = unsigned; };
    template<> struct type_size<2> { using type = short unsigned; };
    template<> struct type_size<1> { using type = unsigned char; };

    constexpr unsigned powerOf(unsigned base, unsigned exponent) noexcept {
        return exponent ? base*powerOf(base, exponent - 1) : 1;
    }

    constexpr unsigned logCeiling(unsigned base, unsigned v) noexcept {
        return v <= 1 ? 0 : 1 + logCeiling(base, (v + base - 1)/base);
    }

    template<typename T> constexpr T max(T a, T b) { return a < b ? b : a; }

    template<typename ElementType, unsigned MaxIndex> using Index =
        typename type_size<
            powerOf(
                2,
                max(
                    logCeiling(2, sizeof(T)),
                    logCeiling(256, Max)
                )
            )
        >::type;
}

Just a few definitions and the compiler will be able to pick the right index type. Provided you have an implementation of a heap manager, you’d have reduced the space overhead of containing elements in lists to its optimum. Since indexing as opposed to direct dereferencing has no penalty, the net effect of generic programming plus metaprogramming are

Generics make it so that two implementations (normal single linked list and arenas) reduce the chances for errors of doing things manually or the different varieties to drift apart
The space overhead is minimal,
the increase in space efficiency translates into net speedup due to more effective caching
No other performance penalty
Increased compilation time

In essence, metaprogramming has been used in this example to express the relationship between the optimal index type to the element type and the maximum number of nodes that will be supported. The benefit or the importance of being able to code this relationship is trivial, however, the relationships that can be expressed with the mechanisms of metaprogramming are of arbitrary complexity, and they lead to truly great benefits.

Wednesday, December 23, 2015

Lists of function call arguments

Another thing not expressible in C++ are lists of arguments to functions, but curiously enough, lists of arguments to templates are actually expressible in a limited way.

Whenever I have referred to inexpressible things, what I have meant is that one loses the help from the compiler to keep track of details it should be able to keep track of, one is forced to do it manually, even resorting to very dangerous copying and pasting.

I’ve just read in the e-zine “Overload” number 130 the QM Bite on boolean parameters. The author points out three issues with boolean parameters, for example’s sake, let’s say a function that takes three flags to construct a font: bold, italic, underlined.

Without any more information, can you tell what does this fragment of code does?

auto userFont = createFont(12, false, true, false, true);

What about this other fragment?:

auto userFont = createFont(12, FontOptions().italic().underlined());

In this article we will work towards implementing a building block that will let us transform lists of boolean arguments into a call as the one just portrayed.

The author of the QM Bite (Matthew Wilson) objects to boolean parameters. The three issues he mentions are:

Since they all have the same time, it is easy to to try to set boldness but make a mistake and set italic instead (you’d have to remember the exact position of each parameter)
It will be difficult to ever change the order of the arguments, change their number, etc.
The arguments obfuscate the code, to see what a true as argument means, the reader needs to jump to the declaration of the argument, manually count how many positions.

I would add a fourth complaint: The quantity of arguments must be minimized, the fewer the number of arguments, the easier to understand, manage what a function does. It improves quality.

Since lists of function call arguments are not expressible, try to avoid passing individual parameters, instead, group them together into sets. As always, the members of the set must be highly cohesive. If you do this right, then you’d have an abstraction improvement that will accelerate your development.

Continuing with the example by Matthew Wilson, imagine you capture the flags from the user: Preserving the list of arguments, there is no recourse but to manually copy the values of each of the flags every time something needs propagating. This makes no sense, it is clear that bold, italic, underlined, strike-through, etc. are a set of configuration options. Implement them exactly like that.

I think Mr. Wilson’s advice is no good, “enumeralization” won’t help you any: You incurr the extra work to create synonims of the type bool (which is not easily supported in C++ either), and you continue to have the problems of remembering the exact order of the arguments, the same inflexibility as to their quantity, etc. All you have helped yourself with is for the compiler catching when you accidentally set, for example, bold when italics was intended, a dubious trade for the extra work.

Here’s a hopefully better solution, if we could get for free the following code, the implementation of the set of options:

struct FontOptions {
    enum Options: unsigned char {
        UNMODIFIED = 0,
        BOLD = 1,
        ITALIC = 2,
        UNDERLINED = 4,
        STRIKE_THROUGH = 8
    };

    unsigned char value = 0;

    FontOptions() = default;
    FontOptions(const FontOptions &) = default;

    FontOptions unmodified() const { return FontOptions(0); }
    FontOptions bold() const { return value | BOLD; }
    bool is_bold() const { return value & BOLD; }
    FontOptions not_bold() const { return value & ~BOLD; }
    FontOptions italic() const { return value | ITALIC; }
    bool is_italic() const { return value & ITALIC; }
    FontOptions not_italic() const { return value & ~ITALIC; }
    FontOptions underlined() const { return value | UNDERLINED; }
    bool is_underlined() const { return value & UNDERLINED; }
    FontOptions not_underlined() const { return value & ~UNDERLINED; }
    FontOptions strike_through() const { return value | STRIKE_THROUGH; }
    bool is_strike_through() const { return value & STRIKE_THROUGH; }
    FontOptions not_strike_through() const { return value & ~STRIKE_THROUGH; }

    // The usual equality, difference operators, perhaps even the relational
    // not implemented for brevity of exposition

private:
    FontOptions(unsigned v): value(v) {}
};

then creating a font could have this signature: Font createFont(unsigned size, FontOptions options). And it could be called like this:

auto myFont = createFont(12, FontOptions().italic().underlined());

As you can see, instead of suffering of problems mentioned, this example:

shows how easy it is to understand exactly what are the options set,
there is no need to remember the order,
introducing new options won’t even necessarily require recompilation, and in particular no code change,
and should you want to change the meaning of any of the options, a simple grep of the desired setter will tell you reliably all the call sites so that you change accordingly.

Very probably inside the Font implementation you’d want to have a set of options. You could use ‘FontOptions’ exactly as it is inside Font. You could communicate internally the whole set of options using FontOptions. It is useful.

Observe that the implementation of FontOptions is very repetitive. And incomplete, since the constructors, setters, etc, are not constexpr noexcept, there is no operator for comparison, etc; it is error prone too, at any moment one could have written ‘italic’ meaning ‘bold’. These things are best left to something automated. Unfortunately, we begin with a list of identifiers (bold, …) and as already explained in the previous article, there is no support in pure C++ for them; we have turned the inexpressible function call argument list into a list of identifiers, thus, as done in the previous article, the next best choice is to use preprocessing:

#include <boost/preprocessor/seq/for_each_i.hpp>
#include <boost/preprocessor/cat.hpp>

#define PP_SEQ_I_DECLARE_ENUMERATION_VALUE(r, dummy, i, e) e = (1 << i),
#define PP_SEQ_I_FLAG_SET_MEMBER_FUNCTIONS(r, name, i, e)\
    constexpr name e() const noexcept { return value | internal::e; }\
    constexpr name BOOST_PP_CAT(not_, e)() const noexcept { return value & ~internal::e; }\
    constexpr bool BOOST_PP_CAT(is_, e)() const noexcept { return value & internal::e; }

#define BOOLEAN_FLAG_SET(name, flags)\
    class name {\
        struct internal { enum: unsigned char {\
            EMPTY = 0,\
            BOOST_PP_SEQ_FOR_EACH_I(PP_SEQ_I_DECLARE_ENUMERATION_VALUE, ~, flags)\
        }; };\
        unsigned char value = 0;\
        constexpr name(unsigned v) noexcept: value(v) {}\
    public:\
        name() = default;\
        name(const name &) = default;\
        BOOST_PP_SEQ_FOR_EACH_I(PP_SEQ_I_FLAG_SET_MEMBER_FUNCTIONS, name, flags)\
        constexpr bool operator==(name other) const noexcept { return value == other.value; }\
        constexpr bool operator!=(name other) const noexcept { return !(*this == other); }\
    }

The use of the macro BOOLEAN_FLAG_SET is very easy, for example:

BOOLEAN_FLAG_SET(FO, (bold)(italic)(underlined)(strike_through));

This would declare and define a class FO that does the same things as FontOptions. I hope you’d agree that declaration and definition takes very little effort and prevents lots of errors. Also, we could improve this so that we also get inserter and extractor operators with minimal effort.

Before explaining the implementation, as a conclusion to this article, I advocate for the following:

Because there is no support for list of arguments, avoid them
Strive to minimize the number of arguments in your function calls
Grouping parameters into a single aggregate tends to capture useful semantics of the application domain that will speed up development and improve quality
Pure C++ offers limited support for expressing the groups of parameters identified, typically, the preprocessor may palliate this defficiency.

The same technique illustrated in this example can be generalized for arbitrary collections of parameters; for example, to build a date:

Date().month(FEBRUARY).year(2015).day(28)

That is, the same pattern can be used but the setters may take a value argument. Expressing the type of these arguments make the macro more complicated, hence its exposition will be postponed.

Also, there is no need to worry about returning by value and potentially doing wasteful copies of returned values, observe that the compiler has full visibility into what is going on, all the intermediate values are temporaries, and the “return value optimization” can be applied recursively from last to first (in this case from calling day back to Date()); hopefully this will be illustrated in the real world. Also notice that this pattern of acting on a value and not changing it but returning a modified copy is a tried and true technique in functional programming languages.

Monday, November 30, 2015

Introspection, preprocessing and boost preprocessing

I was reading our blogspot neighbor Scott Meyers, “The Brick Wall of C++ Source Code Transformation”, where he discusses one of the worst problems in C++: The overreliance on the preprocessor. For a few years now I have been using the preprocessor to palliate the deficiencies of the introspection capabilities. I hope that would be another class of legitimate use of the preprocessor such as to #include headers because there is no concept of modules and simple management of versions through conditional compilation, like when one asks if the code is being compiled with G++, #if defined __GNUG__. I am not entirely sure this use is legitimate, but it seems the least bad of the practical options. Let us get to know the animal --monster-- of the preprocessor in more detail, to appreciate its cost this time. In later articles I hope to make good exhibits for the zoo, in which this animal shows its potential.

What is the problem?

The preprocessor is a source code transformation language, the problems are that it is a very poor language and inserting a source code transformation layer between the programmer and the compiler makes their lives much harder. Concretely, it makes making tools hard to nearly impossible, makes compilation much slower, it makes source code much harder to understand and it invites problems.

Headers

Although the language does not support modules we still need to import their interfaces, that’s why we speak of declarations and definitions. For example, #include is equivalent to copying and pasting the included file into the current file. However, the declarations in the included header are affected by the context carried over, and this opens a can of worms. Let’s say in file header1.h there is this code:

namespace library1 {

// some declarations
#define ITEM product
// more declarations, ITEM is never undefined

}

Later, another header may have this content:

namespace library2 {

// some declarations
struct ITEM {
    void not_inline_function();
    // …
};

// more declarations
}

Then, if a file.cpp has this content:

#include “header1.h”
#include “header2.h”

Every mention to “ITEM” will in reality refer to a product. The program may even compile, since the declaration struct ITEM will be silently converted to struct product in a different namespace, and within that namespace, all the mentions to ITEM will be consistently changed to product. It may even link!, if all the uses of ITEM are for data members and inlined functions. However, months later some change happens to the code, somebody uses the member function not_inline_function, and gets stomped with the link error undefined reference to not_inline_function

Back when people were switching from 32 bits to 64 bits, it was frequent for code to assume a pointer and an integer were of the same size, so, there was lots of people that would #define int long and rebuild. Of course, things may appear to work, the developer may be praised for how fast he did the migration, but things will eventually explode catastrophically because some “who knows what” library function not recompiled assumed 32 bits for an integer and it got 64.

Perhaps my examples are not good, in any case, the point I am trying to make is that a header and its corresponding binary library must be exactly in sync, otherwise, things may seem to work until they explode; because headers are vulnerable to the context in which they are #included, going through the preprocessor introduces all sorts of nasty possibilities for subtle errors.

Another problem, of a practical nature, is that the compiler is forced to compile the #included headers every time they appear, because since the interpretation of a header depends on the exact code that precedes its inclusion, there is no remedy but to do it all over! Note: this is what makes precompiled headers worthless: To guarantee the headers will all have the same meaning, the programmers must put together all the headers they could potentially include into a single ‘universe’ header, which is big, fat, greasy with leaking references to implementation details, couplings to implementation details and very low cohesiveness. Then, in all places were well thought-off headers would be included, to include the universe instead.

Versions

The other legitimate preprocessor case I mentioned of conditional compilation and versioning suffers from not having a way to guarantee that a prebuilt binary is compatible with the current source code and compilation options, leading in practice to the wasteful re-compiling and re-linking; or much worse, that apparently things work but in reality have catastrophic bugs: Let’s say a.cpp gets compiled assuming struct Foo has a member int member but because of subtle interplay of conditional compilation, in b.cpp struct Foo may have a member long member. Things may appear to be fine, but the layout of Foo is different, somewhere some code may think it is X bytes long, and some other Y bytes long, and then all bets are off.

These are not new problems, they are actually over forty years old. The culprit is that the language does not offer any way to guarantee declarations for library users will respect the assumptions made in their compiled/linked binary implementations. It is truly embarrassing, and I’ve not seen any prospects for a solution, not even for C++17 when some form of modules are being discussed.

Toolability

Because of language deficiencies, we still need to use the preprocessor, and this means that the only tools that can be developed for the language must in some way or other be also compilers. For example, imagine an IDE tool that wants to help with autocompletion of function calls. To be useful, the tool must be able to add applicable functions that are the result of macro expansions. Thus the tool must be able to preprocess, on top of all the other pure language requirements. It also must be able to know which macro expansions correspond to what places in the source code. What about function-like macros? when doing auto completion, is it going to offer the macro or only its expansion?

Introspection

I hope the claim that the preprocessor is a real problem has been substantiated, however, for all of its expressiveness the language has deficiencies that are covered by the preprocessor, thus it may be legitimate to continue to alleviate the deficiencies through the preprocessor. A consistent set of introspection features is sorely lacking. By introspection I mean what commonly is referred to as “reflection”, when there are ways in which programming constructs can reason about themselves. For example, in Java it is trivial to ask a class for its data and function members; it is possible to know everything there is to know about them. In C++ there is a collection of ad-hoc introspection capabilities, such as whether dynamic_cast of a pointer returns zero or not, the typeid, the sizeof operator and the very powerful collection of pattern matching capabilities of templates. For all of their might, it is simply not possible to know even the number of data members in a plain structure. It is not possible to know, from within the program itself, the count of enumerations in an enumerated type, nor what are the identifiers used, nor the mapping of enumerations to integer values, nor their reverse mapping.

The only option to accomplish introspection capabilities is to develop idioms or conventions so that we can plug some of the capabilities mentioned above.

The use case that led me down the path of using boost preprocessing was the simple need of converting enumerated values to strings. For example:

enum Enumeration {
    VALUE1 = 9,
    VALUE2 = 8,
    VALUE3 = 2
};

How to implement something like Enumeration::count() that will tell you the number of enumerated values, or Enumeration::values(), or Enumeration::identifiers()? This would need to be done manually. Typically, one wants to make something like std::ostream &operator<<(std::ostream &output, Enumeration value);, this is what we repeatedly waste our time doing:

std::ostream &operator<<(std::ostream &output, Enumeration e) {
    switch(e) {
        case VALUE1: output << “VALUE1”; break;
        case VALUE2: output << “VALUE2”; break;
        case VALUE3: output << “VALUE3”; break;
    }
    return output;
}

It does not look like much work, and it isn’t, but the real problem is that some day somebody will want to put VALUE4=13, forget to map it, and then the code won’t work. Also, somebody will just copy and paste forgetting to change the number:

std::ostream &operator<<(std::ostream &output, Enumeration e) {
    switch(e) {
        case VALUE1: output << “VALUE1”; break;
        case VALUE2: output << “VALUE1”; break;
        case VALUE3: output << “VALUE1”; break;
    }
    return output;
}

The compiler is happy. Your tests will be happy too, but in the critical log entry for your application, whenever VALUE2 appeared you’ll see VALUE1 written and be mystified…

Doctrine

I don’t like guranteeing relationships between software components manually, it is the same thing as using explicit new and delete instead of smart pointers; it is error prone. I like C++ because it lets me express fairly non-trivial relationships so that the compiler will guarantee them for me. When pure C++ is not sufficient, I try hard to find the least meta-C++ before resorting to do things manually. This element of my doctrine has served me very well. Even if my choices to represent relationships look very ugly, at least they are expressed, subject to improvement, and handled automatically; over time, things look less ugly, because you learn how to express yourself better, reducing accidental complexity and the inherent complexity is reduced once you realize it is necessary.

None of these happen with manual coding of relationships.

In general, I have reached to the conclusion that lists of identifiers are inexpressible in pure C++. Thus we have to look beyond for ways to handle lists of identifiers. Is it possible to express them using the preprocessor?

Boost preprocessing

Four years ago I discovered boost preprocessing while getting acquainted with the code of a system new to me, a genius coworker had coded a general solution to declare, define and print enumerations which used boost preprocessing, and at the same time, before C++11 his solution was capable of expressing enumerations in their own scope and based off user-specified integer types.

My old coworker got to different choices to what I will propose, but the idea is essentially the same. Let’s work toward the refined solution I made, illustrating each objective and solution.

Let us begin with the enumeration shown above. To be able to “reason” about that enumeration, at the bare minimum we need something that converts enumerated values to their identifier as a string. We need little else; these few things can be put together in an structure, the way in which we typically bind together related compile-time constructs. Let us associate the enumerated values to their strings by using a simple std::array:

#include <array>

struct demo {
    using integral_type = unsigned char;

    enum enumeration {
        VALUE1 = 8,
        VALUE2 = 9,
        VALUE3 = 2
    };

    using map_type = std::unordered_map<integral_type, const char *>;

    static constexpr unsigned count = 3;

    static constexpr std::array<typename map_type::value_type, count> values =
    {{
        { VALUE1, "VALUE1"}, { VALUE2, "VALUE2" }, { VALUE3, "VALUE3" }
    }};
};

demo has integral_type, enumeration, map_type, count and values, that I think serve a very clear role. Pure C++ won’t let us express a relationship between enumeration and values, but there is a way, which we will illustrate momentarily. In any case, with these we can implement any number of interesting introspection capabilities:

Represent enumerated values using the integral_type
All of the value-type services expected: construction, comparison, etc.
It is possible to convert enumerated values to strings and viceversa
Iteration over all the valid values of the enumeration
All of these operations can be implemented to run very fast and potentially as compile-time constructs, that is, as if the language provided these features out of the box.

Let us augment the demo struct with many straightforward features through a template that combines some of the members into useful things. Please observe that indeed we got away with making almost all the things constexpr, noexcept:

#include <unordered_map>

namespace zoo {

template<typename E> struct smart_enumeration: E {
    using typename E::integral_type;
    using typename E::enumeration;

    integral_type code;

    constexpr smart_enumeration() noexcept: code(E::values[0].first) {}

    explicit constexpr
    smart_enumeration(integral_type v) noexcept: code(v) {}
    
    smart_enumeration(const smart_enumeration &) = default;
    
    explicit constexpr
    smart_enumeration(enumeration e) noexcept: code(e) {}

    smart_enumeration &operator=(enumeration e) noexcept
    { code = e; return *this; }

    constexpr operator integral_type() const noexcept { return code; }

    constexpr bool operator==(smart_enumeration e) const noexcept
    { return code == e.code; }
    constexpr bool operator!=(smart_enumeration e) const noexcept
    { return not (*this == e); }

    operator const char *() const {
        static const auto copy = E::values;
            // Note: this copy prevents having to "define" E::values
        static const std::unordered_map<integral_type, const char *>
            mapping(copy.begin(), copy.end());
        auto resultFind = mapping.find(code);
        if(mapping.end() == resultFind) { return nullptr; }
        return resultFind->second;
    }

    bool valid() const { return nullptr == static_cast<const char *>(*this); }
};

}

Note: Yeah, none of the constexpr, noexcept, and even const above are superflous: The compiler does not automatically allow the use of a non-constexpr function in compile-time expressions, nor deduces noexcept automatically; and because it was deemed a small mistake of C++ 11 that constexpr implies const in C++ 14 it doesn’t. It is a shame since all of these annotations add clutter and are deducible by the compiler.

Note2: The conversion to string is implemented not as a compile-time function because it uses an unordered_map; however, with enough effort it is possible to implement a compile-time map, even in C++11, just that it is not practical.

Now, the implementation of things like the insertion operation:

#include <istream>

namespace zoo {

template<
    typename E
> std::ostream &operator<<(std::ostream &output, zoo::smart_enumeration<E> e) {
    const char *ptr = e;
    if(ptr) { output << ptr; }
    return output;
}

}

Let us write a non-inline function that the compiler is forced to translate to drive enough of the implementations we have, also a “main” to prove we don’t have linking issues:

#include <iostream>

void drive(zoo::smart_enumeration<demo> d) {
    using se = zoo::smart_enumeration<demo>;
    static_assert(se().code == 8, "");
        // constexpr default constructor
    static_assert(noexcept(se()), "");
        // the default constructor is noexcept
        // Proven the constructor by integral and by enumeration are
        // constexpr and noexcept
    static_assert(noexcept(se(d)), ""); // default copy constructor is noexcept
    static_assert(9 == se(se(demo::VALUE2)).code, ""); // also constexpr

    static_assert(noexcept(se(8) == se(9)), ""); // equality comparison noexcept
    static_assert(noexcept(se(8) != se(9)), ""); // different is also noexcept
    static_assert(!noexcept(static_cast<const char *>(se())), "");
        // however, the conversion to const char * can throw
    std::cout << d << d.valid(); // All compiles
}

int main(int argc, const char *argv[]) { return 0; }

So, if we are able to supply the members that smart_enum requires, the same we put in demo, then forever we will be able to automatically get all the other stuff, and use our proto-introspection to implement lots of other things. For example, the backward map (string to enumeration) implemented in construct_smart_enum(std::string):

namespace zoo {

template<
    typename F, std::size_t N
> std::unordered_map<std::string, F> backward_map(
    std::array<std::pair<F, const char *>, N> argument
) {
    std::array<std::pair<std::string, F>, N> initializer;
    for(unsigned i = argument.size(); i--; ) {
        initializer[i].first = argument[i].second;
        initializer[i].second = argument[i].first;
    }
    return
        std::unordered_map<std::string, F>(
            initializer.begin(), initializer.end()
        );
}

template<typename E> smart_enumeration<E> construct_smart_enum(std::string s) {
    using se = smart_enumeration<E>;
    const static auto copy_for_gcc = E::values;
    const static auto reverted_map = backward_map(copy_for_gcc);
    return se(reverted_map.find(s)->second);
}

}

And the test, that also shows a possible way to use it:

smart_enumeration<demo> something()
{ return construct_smart_enum<demo>("VALUE2"); }

There only remains to show the magical incantation that will bind the identifiers to the enumeration values:

PP_SMART_ENUMERATION(
    demo,
    unsigned char,
    ((VALUE1, 8))((VALUE2, 9))((VALUE3, 2))
);

The expansion of that macro call will generate the code we wrote for demo above. This is the prestidigitation I just did in slow motion:

#include <boost/preprocessor/seq/for_each.hpp>
#include <boost/preprocessor/seq/size.hpp>

#define PP_SEQ_META_CALL(r, MACRO, element) MACRO element

#define PP_MAKE_MAP_PAIR(identifier, value) { value, #identifier },
#define PP_MAKE_ENUMERATION_PAIR(identifier, value) identifier = value,

#define PP_SMART_ENUMERATION(name, integer_type, identifiers)\
struct name {\
    using integral_type = integer_type;\
    enum enumeration {\
        BOOST_PP_SEQ_FOR_EACH(PP_SEQ_META_CALL, PP_MAKE_ENUMERATION_PAIR, identifiers)\
    };\
    using pair_type = std::pair<integral_type, const char *>;\
    static constexpr unsigned count = BOOST_PP_SEQ_SIZE(identifiers);\
    static constexpr std::array<pair_type, count> values = {{\
        BOOST_PP_SEQ_FOR_EACH(PP_SEQ_META_CALL, PP_MAKE_MAP_PAIR, identifiers)\
    }};\
}

Not entirely self-explanatory? of course, and I have not even gone over the things that make this code break, but this is a great opportunity to leave the article in a cliff hanger…

Wednesday, November 4, 2015

Gaining clarity and performance

We will now look into the implementation of several argument passing choices.

Most of the argument passing is governed by the ABI, the “Application Binary Interface” which is the convention compiler vendors agree on so that code compiled with one vendor can be used by code compiled by another. Since I am primarily an x86-64 on Linux programmer I will concentrate almost exclusively on this platform, the most recent ABI can be seen here.

We saw in the article “Aliasing and autovectorization” how aliasing disrupts autovectorization. We will focus now on a simpler case, when passing by const reference versus passing by value only lead to the apparently small difference of using a parameter from memory versus using it from register.

Again we have to reduce the real life complications such as alignment. Previously I used the GCC extension __builtin_assume_aligned, which is also supported by Clang versions 3.5 and higher; however, that extension can only be applied to pointers, and I would like to move away from pointers. I will use GCC’s extension to define SIMD (Single Instruction Multiple Data) vector types, which is also supported by Clang. I don’t recommend using GCC’s SIMD vectors, I think they are actually broken. The situation is better with Clang. This is a subject matter that will be revisited when discussing the emerging superiority of Clang over GCC. Be it as it may, if the vector size is supported by the platform, SIMD vectors implicitly set the alignment to the vector size. By the way, if you can successfully run SIMD vectors on x86 using GCC’s extension, you can almost rest assured your code will be just as vectorized in any other platform because x86’s SSE instruction set is probably the hardest to use.

In the same example I’ve been using, this time I forced the basic data type to be a vector of 4 doubles, see this code that simply adds a value to the each element of the vector:

#include <vector>

// typedefs v4d as a vector of 32 bytes in which elements are doubles
// that is, a vector of 4 doubles
using v4d = double __attribute__((vector_size(32)));

void addVector(std::vector<v4d> &v, v4d value) {
    for(auto &e: v) { e += value; }
}

void addVectorByCR(std::vector<v4d> &v, const v4d &value) {
    for(auto &e: v) { e += value; }
}

The awesome tool Online GCC Explorer lets us see what GCC 5.2 --among other compilers and versions--, with 256-bit vector registers enabled (compiler options -mavx and -mavx2), produces:

addVector(std::vector<double __vector(4), std::allocator<double __vector(4)> >&, double __vector(4)):
    movq    8(%rdi), %rdx
    movq    (%rdi), %rax
    cmpq    %rdx, %rax
    je  .L7
.L5:
    vaddpd  (%rax), %ymm0, %ymm1
    addq    $32, %rax
    vmovapd %ymm1, -32(%rax)
    cmpq    %rax, %rdx
    jne .L5
.L7:
    vzeroupper
    ret
addVectorByCR(std::vector<double __vector(4), std::allocator<double __vector(4)> >&, double __vector(4) const&):
    movq    8(%rdi), %rdx
    movq    (%rdi), %rax
    cmpq    %rdx, %rax
    je  .L15
.L13:
    vmovapd (%rax), %ymm0
    addq    $32, %rax
    vaddpd  (%rsi), %ymm0, %ymm0
    vmovapd %ymm0, -32(%rax)
    cmpq    %rax, %rdx
    jne .L13
    vzeroupper
.L15:
    rep ret

We can see that there is only a deceptively small difference between passing by value the addendum (addVector) and passing it by const reference (addVectorByCR), let us explain the generated code for both versions at the same time since they are almost identical, which was the point:

The first two lines (2-3, and 16-17) are just loading the vector begin and end into rdx and rax (by the way, this clearly shows that the memory layout of vectors begins with two pointers, the buffer and its end),
Lines 4-5 or 18-19 are simply a check for empty vector
lines 7-11 or 21-26 are the main loop:
- line 7 adds *rax to ymm0 into ymm1, that is, value was passed by value into ymm0.
- Since the addendum is passed by const reference into addVectorByCR, the first thing is to load *rax into a vector register, ymm0, at line 21
- lines 8 and 22 simply makes rax point to the next element in the vector
- line 23 performs the addition from memory through the pointer rsi, this is the only difference
- lines 10-11 and 25-26 just decide whether to loop back.
lines 13 and 27 are just an ABI artifact. Perhaps I will devote an article to “VZEROUPPER” after some preliminaries of SSE have been exposed.

From the point of view of performance, I would say the gcc 5.2 thinks adding from memory is about as fast as adding from a register, and it probably is, since value will be in the L1 cache for most of the iterations. But what would happen if another thread writes, not even to the actual location of &value, but just to the same cache line as &value?: The cache line for &value will be invalidated, thus, the thread iterating will stall while bringing the value of the addendum from memory to the L1 cache, this will have a latency equal to the latency of the cache level at which the cache line for &value is shared with the modifier, potentially all the way to main memory, which is about 100 times slower than L1. For sure it will be L2 at least (since in all x86 implementations the L1 cache is exclusive) which is around 4 times slower than L1. What if the thing the other thread is writing to is something like an accumulator of all the values in a vector? The other value will be changing very often, thus the cache line will be invalidated often and things may run 4, 10, 100 times slower… This problem is referred to as “false sharing”.

This is not an hypothetical scenario, false sharing is illustrated in plenty of good resources available online, including this presentation, around minute 8 by Scott Meyers.

So, we have another set of cases in which the apparently innocuous difference between passing by value and by const reference may lead to dramatic performance differences.

For small data types, that hopefully have copy constructors without side effects, it is clear one ought to pass them by value as much as possible.

There is some support to discover both things in templates: The operator sizeof will tell you the size of the type, and a stronger set of conditions on the nature of the copy constructor can be queried through is_trivially_copyable, so, with some effort (not illustrated now, but it will be in the future), it is possible to design templates that make an informed decision on whether to accept arguments by value or const reference.

Having the small size covered, let us explore the argument passing with increasingly larger types. If a type is sufficiently large, the ABI will prescribe the passing by value to be implemented not by register, but by stack, that is, as a “hidden reference” to the stack. This is illustrated by creating a structure big enough, Big, and using the same algorithm, add.

#include <vector>

// typedefs v4d as a vector of 32 bytes in which elements are doubles
// that is, a vector of 4 doubles
using v4d = double __attribute__((vector_size(32)));

struct Big {
    int otherStuff[32]; // 32 members of 4 bytes each
    v4d a, b, c, d; // 4 members of 32 bytes each

    Big &operator+=(const Big &v) { a += v.a; b += v.b; c += v.c; d += v.d; }
};

void add(std::vector<Big> &v, Big addendum) {
    for(auto &e: v) { e += addendum; }
}

void addByCR(std::vector<Big> &v, const Big &addendum) {
    for(auto &e: v) { e += addendum; }
}

Big opaqueFunction();

void opaqueAdd(std::vector<Big> &v, Big value);

void informedCall(std::vector<Big> &v) {
    Big addendum = opaqueFunction();
    add(v, addendum);
}

void uninformedCall(std::vector<Big> &v) {
    Big addendum = opaqueFunction();
    opaqueAdd(v, addendum);
}

Which gives this result:

add(std::vector<Big, std::allocator<Big> >&, Big):
    movq    (%rdi), %rax
    movq    8(%rdi), %rdx
    cmpq    %rdx, %rax
    je  .L6
    vmovapd 136(%rsp), %ymm4
    vmovapd 168(%rsp), %ymm3
    vmovapd 200(%rsp), %ymm2
    vmovapd 232(%rsp), %ymm1
.L3:
    vaddpd  128(%rax), %ymm4, %ymm0
    addq    $256, %rax
    vmovapd %ymm0, -128(%rax)
    vaddpd  -96(%rax), %ymm3, %ymm0
    vmovapd %ymm0, -96(%rax)
    vaddpd  -64(%rax), %ymm2, %ymm0
    vmovapd %ymm0, -64(%rax)
    vaddpd  -32(%rax), %ymm1, %ymm0
    vmovapd %ymm0, -32(%rax)
    cmpq    %rax, %rdx
    jne .L3
    vzeroupper
.L6:
    rep ret
addByCR(std::vector<Big, std::allocator<Big> >&, Big const&):
    movq    (%rdi), %rax
    movq    8(%rdi), %rdx
    cmpq    %rdx, %rax
    je  .L14
.L12:
    vmovapd 128(%rax), %ymm0
    addq    $256, %rax
    vaddpd  128(%rsi), %ymm0, %ymm0
    vmovapd %ymm0, -128(%rax)
    vmovapd -96(%rax), %ymm0
    vaddpd  160(%rsi), %ymm0, %ymm0
    vmovapd %ymm0, -96(%rax)
    vmovapd -64(%rax), %ymm0
    vaddpd  192(%rsi), %ymm0, %ymm0
    vmovapd %ymm0, -64(%rax)
    vmovapd -32(%rax), %ymm0
    vaddpd  224(%rsi), %ymm0, %ymm0
    vmovapd %ymm0, -32(%rax)
    cmpq    %rax, %rdx
    jne .L12
    vzeroupper
.L14:
    rep ret
informedCall(std::vector<Big, std::allocator<Big> >&):
    leaq    8(%rsp), %r10
    andq    $-32, %rsp
    pushq   -8(%r10)
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r10
    pushq   %rbx
    movq    %rdi, %rbx
    leaq    -272(%rbp), %rdi
    subq    $256, %rsp
    call    opaqueFunction()
    movq    (%rbx), %rax
    movq    8(%rbx), %rdx
    vmovapd -144(%rbp), %ymm4
    vmovapd -112(%rbp), %ymm3
    cmpq    %rdx, %rax
    vmovapd -80(%rbp), %ymm2
    vmovapd -48(%rbp), %ymm1
    je  .L21
.L19:
    vaddpd  128(%rax), %ymm4, %ymm0
    addq    $256, %rax
    vmovapd %ymm0, -128(%rax)
    vaddpd  -96(%rax), %ymm3, %ymm0
    vmovapd %ymm0, -96(%rax)
    vaddpd  -64(%rax), %ymm2, %ymm0
    vmovapd %ymm0, -64(%rax)
    vaddpd  -32(%rax), %ymm1, %ymm0
    vmovapd %ymm0, -32(%rax)
    cmpq    %rax, %rdx
    jne .L19
.L21:
    vzeroupper
    addq    $256, %rsp
    popq    %rbx
    popq    %r10
    popq    %rbp
    leaq    -8(%r10), %rsp
    ret
uninformedCall(std::vector<Big, std::allocator<Big> >&):
    leaq    8(%rsp), %r10
    andq    $-32, %rsp
    pushq   -8(%r10)
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r12
    pushq   %r10
    pushq   %rbx
    leaq    -304(%rbp), %rbx
    movq    %rdi, %r12
    subq    $280, %rsp
    movq    %rbx, %rdi
    call    opaqueFunction()
    subq    $256, %rsp
    movq    %rbx, %rsi
    movl    $32, %ecx
    movq    %rsp, %rdi
    rep movsq
    movq    %r12, %rdi
    call    opaqueAdd(std::vector<Big, std::allocator<Big> >&, Big)
    addq    $256, %rsp
    leaq    -24(%rbp), %rsp
    popq    %rbx
    popq    %r10
    popq    %r12
    popq    %rbp
    leaq    -8(%r10), %rsp
    ret

The resulting lines 6 to 9 showing registers ymm4 to ymm1 initialized to a, b, c, d through stack addresses prove that the argument addendum to add was copied to the stack for the call. addByCR shows what we expect, the additions are performed directly from memory all the time, but there is no copying (the lines that follow the pattern vaddpd 128(%rsi), %ymm0, %ymm0, with constants 160, 192, 224. Is this good? Before answering, notice that there are two differences between passing by value and by const reference in this case of big arguments:

There is at least some copying to the stack versus no copying,
and the compiler may decide to use registers locally (as shown in lines 6 to 9) despite the fact that the arguments are in memory to do the work in the case of passing by value but not in the case of passing by reference. That is, with large types, passing by value is actually a hybrid between theoretical passing by value and passing by const reference.

What about the other 32 integers in the member otherSuff? those are 128 bytes that must be copied to the stack, right? this is what passing by const reference tries to avoid, it has got to be expensive.

We don’t yet see anything about otherStuff because the callee does not care about it, then let us study what the compiler did for two cases of callers:

when the caller knows the implementation of the function, informedCall,
and when it does not, uninformedCall.

informedCall calls add, and it knows its implementation, as well as the implementation of the copy constructor for Big, which lets it perform the inlining we see, where those 128 bytes are not copied, all that is done with memory is to load to registers the members a … d produced by opaqueFunction in lines 63, 64, 66 and 68. In the ‘uninformed’ case, lines 105 to 107 show the copy of the whole value, 32 8-byte things (this is using the move string operation, which roughly translated means while(rcx--) { *rdi++ = *rsi++; }, with pointed-to sizes of 8, remember the register rsi contains the address of the addendum). But the informed call did not copy everything, thus, passing by value a type that respects value semantics to a function whose implementation is known by the compiler won’t suffer from unnecessary copying nor aliasing! this is especially important because the great mayority of template function calls are such that the caller has the implementation available!

This observation can be leveraged in practice very easily:

In template implementations residing in headers that users will #include, if the copy constructor is sane, like it does not have side effects, you can almost be certain at most the things that matter will be copied and this only after the compiler determines the copy necessary or advantageous. I have gotten confident about passing by value to implementation-known templates because I am very strict about the copy constructors I program --and lazy!, since I very seldom use anything but the default!–
If you want to keep your implementation secret, before falling back to const referencing, you can ask:
- If the function messes around with all or almost all of the argument, then pass by value since the copy construction prevented by const referencing will be done piece-meal anyway in the body
- If the function will use only a small subset of the fields, isn’t the function telling you that its interface may be wrong, because it receives too much?
  - If the function is indeed receiving the whole tropical rainforest when it only requires a banana you have a problem.
  - But sometimes your interface can legitimately require the whole tropical rainforest because whether you only use a banana may be an implementation detail of no concern to the user, or something that may change in the future. Let us explore this case in more detail, because there may be a way:

Let’s say you have a big type, WholeKitchen, and your design legitimately requires your secretSauce function to receive it all. Furthermore, you neither want to provide the implementation. This is the traditional signature one would arrive at:

Something secretSauce(const WholeKitchen &kitchen);

And it may be fine, but could this be acceptable?:

Something notAsSecretASauce(Stove s, Water w, TomatoContainer tc, Oil o, Salt s);

inline Something secretSauce(WholeKitchen kitchen) {
    auto &k = kitchen;
    return notAsSecretASauce(k.stove, k.water, k.tomatoes, k.oliveOil, k.salt);
}

That way you can call ‘secretSauce’ repeatedly, never suffer aliasing, decimate the chances for false sharing, be assured the value won’t change and keep the option to change the implementation without breaking user code. It works, because as illustrated above, the inlining will extract or copy in the most advantageous way the members actually used in the function that does the work. Observe also that the extra work performed to pass by value may actually improve performance, since the split between secretSauce and notAsSecretASauce did not introduce any performance penalty, but allows the compiler to optimize for the members stove, …, salt; furthermore, if some type in notAsSecretASauce should be expensive to copy, the same technique could be applied, potentially ending up with only small parameters fully optimized, simply a side effect of letting the compiler know more.

Summarizing, for value semantics types and templates, one ought to ask oneself, “do I really want to pass by const reference?”

I discovered these facts about a couple of years ago, and got serious about it especially after I read “Want Speed? Pass by Value” by Dave Abrahams --the original is not available anymore, but the WayBackMachine has a copy. It’s not that it is a done deal, the language semantics today sometimes conspire against passing by value, and there are other issues, the thing is that I have gained experience at aggressively switching from the traditional passing by const reference to passing by value, and I can say that without a doubt, I’ve gotten code that is much easier to understand, because nothing beats passing by value with regards to clarity, especially because it is of great assistance to reduce mutability, and performs better.

When I have had to keep const references it usually is because of code I can’t change that relies on side effects, my function arguments need to reflect changes made elsewhere or indirectly.

I have developed aversion to code that unnecessarily uses references and mutability; it used to be equivalent, but not anymore. I don't think I am alone in this, so, if you want to make not just future-proof, but present-proof code, be more strict about your use of references and mutability.

Also, C++ 11 improves dramatically the support for passing by value, all of these things make for lots of things to talk about in the following articles.

It may be easy to misinterpret a point of this article as saying that passing by value works better if things are inlined aggressively. Inlining aggressively might be performance-detrimental, the warning to not be penny-wise on the passing by value and dollar-foolish on over-inlining needs mentioning. What I have seen, from the point of view of passing by value, is that it reinforces mutually with value semantics, in value semantics the copy constructor tends to be the default bitwise copy, that allows the compiler to reason about the members and do very clever optimizations. As always, inline only what makes sense.

Monday, November 2, 2015

Aliasing and autovectorization

In the previous article, “Complications” I mentioned an example of a serial algorithm, incrementAll, that supposedly gets converted by the compiler into some code that operates in parallel.

We will see that such a thing happens with two of the best compiler suites of all time, CLANG and GCC.

Modern architectures have vector capabilities. For example, the Intel x86-64 has the SSE extensions, which currently have reached 32 registers of 512 bits (16 single-precision floating point elements). Most x86-64 in use today have 16 256 bit registers. I should say Intel’s SSE is the weirdest SIMD instruction set that I know of and that it has severe design errors, at the opportune moment I hope to comment on this in full, but the most popular architecture in the world today has the vector instruction set that it has and because despite design errors it is highly performing, we ought to learn about it.

Modern compilers also know how to automatically transform serial code into vector operations, even with the notoriously hard SSE. Let us see a version of incrementAll more suitable to illustrate autovectorization:

using BigThing = float;

#include <vector>

// note: passing by value, not const reference ==> no aliasing
void incrementAll(BigThing *ptr, BigThing value) {
    ptr = reinterpret_cast<BigThing *>(__builtin_assume_aligned(ptr, 32));
        // note: it is not safe to make the assumption, done here to prevent
        // the complications of handling the alignment
    for(auto i = 0; i < 1024; ++i) { ptr[i] += value; }
}

There are two changes: An array is passed instead of a vector, which is assumed to be aligned to 32 byte boundaries, and the length of the vector is fixed at 1024 (2^10) elements. Still, the source code as written specifies a serial application of += to each element in the array.

Using the stupendous tool “Online Interactive Compiler Explorer” by M. Godbolt we can readily see what GCC 5.2 does with that code, with color highlighting and all, repeated here:

incrementAll(float*, float):
 vbroadcastss %xmm0, %ymm1
 leaq 4096(%rdi), %rax
.L2:
 vaddps (%rdi), %ymm1, %ymm0
 addq $32, %rdi
 vmovaps %ymm0, -32(%rdi)
 cmpq %rdi, %rax
 jne .L2
 vzeroupper
 ret

Which I can traduce for you if you want:

Line 2 means copy the least significant float in vector register 0 into the 8 positions of register ymm1, xmm0 is obviously the argument value, as the x86-64 ABI specifies it should be (first SSE parameter)
Line 3 calculates ptr + 4096 and puts it in scalar register rax (rdi is the first scalar parameter, ptr)
Line 4 adds 8 floats at a time from *ptr into ymm0 (xmm0 and ymm0 are the same register, just viewed as 16 or 32 bytes)
Since the iterations happen 8 floats at a time, the pointer is incremented by 4*8 = 32 in line 6
The result is copied back
Lines 9-10 control the loop
Line 10 is an artifact of the ABI, that it is required to clear the upper 128 bits of all YMM registers

As you can see, only asking GCC to optimize for speed (-03) and let it use AVX, the name for the 256 revision of SSE (-mavx, -mavx2), it goes and transform the serial code into doing 8 floats in parallel. AVX is not needed, you may remove those options and you'd still get vectorization on 128 bit registers.

Clang is even more radical, because it will also unroll the loop four times, that is, its iterations consume 32 floats!. This is a digression, but I suspect this unrolling is actually a pessimization: The processor has only so many execution units, I don’t think it is capable of performing four 256-wide vector additions in parallel, but even if it is capable, the rules of branch prediction say the execution will loop back, which means that as a matter of course the processor itself will pipeline the execution of many iterations simultaneously, which in effect is executing as many additions in parallel as it is capable, but without unnecessarily longer assembler code that reduces the effectiveness of the code cache.

If the assumption for aligment at 32 bytes is removed, you can also see the output from the ICC, which will also vectorize automatically. So, it is proven that the compilers are capable of automatic vectorization.

Just for kicks, let’s do a very simple change, instead of passing value by value, let us pass it by const reference: Clang simply turns off vectorization.

GCC still vectorizes, but only in the optimistic case of the location of value not within the span of the array:

incrementAll(float*, float const&):
 leaq 4(%rsi), %rdx
 leaq 4096(%rdi), %rax
 cmpq %rdx, %rdi
 jnb .L7
 cmpq %rax, %rsi
 jb .L6
.L7:
 vbroadcastss (%rsi), %ymm1
.L5:
 vaddps (%rdi), %ymm1, %ymm0
 addq $32, %rdi
 vmovaps %ymm0, -32(%rdi)
 cmpq %rax, %rdi
 jne .L5
 vzeroupper
 ret
.L6:
 vmovss (%rdi), %xmm0
 addq $4, %rdi
 vaddss (%rsi), %xmm0, %xmm0
 vmovss %xmm0, -4(%rdi)
 cmpq %rdi, %rax
 jne .L6
 rep ret

Line 2 means rdx = &value + 1
Line 3: rax = ptr + 1024
Lines 4-5 mean if(&value < ptr) { goto L7; } // <- do the vector loop
Lines 6-7 mean if(&value < ptr + 1024) { goto L6; } // <- do the scalar loop
lines 9-17 are the vectorized loop
lines 19-25 are the serial loop

Unlike the code that I suggested in my previous article, you can see that GCC gives up on vectorization if &value is within the span of the array. According to Eric Brumer, a developer of the Microsoft compiler, at his “Compiler Confidential” presentation, the Microsoft compiler will keep trying to vectorize, getting to perform very complicated case analysis.

Of course that one should never pass by const reference what could be passed by value without incurring the cost of a copy, but this shows what happens if you do, the compiler is forced to deal with aliasing and no matter what, performance is lost, even if just to make the now necessary checks to detect aliasing.

There is one case in which C++ programmers routinely indicate passing by const reference, even when the arguments may be as small as simple floating point: In templates, when the actual type is not known at programming time; to prevent problems of potentially very expensive copy constructors we have been defaulting for more than 20 years to const references (or pointers).

What is the solution? It is not practical to cover the known cases in which passing by value is superior with template specializations. The answers I have devised thus far are incomplete, but lead to code that is both higher performing and easier to understand. I will be writing about them.

Monday, October 26, 2015

Complications

C++ suffers from extremely bad choices for its defaults, this makes it very hostile to beginners. It also shows its age. It is good to have a good understanding of the complications, to learn to avoid them. Fortunately, the community has devised idioms and conventions to deal with them. This won't be a complete list, and it is probably too much to detail the problems and the strategies for their solutions within the same article, but let us start with some examples:

The most dangerous feature may be naked pointers. Unfortunately, their use tends to be encouraged by novices leading the way for other novices, users of average knowledge and even teachers. This in my opinion is tragic, for several reasons:

Naked pointers are a conceptual requirement to understand smart pointers, however, novices should avoid them at all costs until they have become familiar with at least the three most important classes of smart pointers:

unique pointers,
reference counted,
and intrusive.

What applies to what is easier, to teach up to naked pointers as opposed to go all the way into smart pointers, is also reflected in the deceptively simple notation for naked pointers and the tedious and repetitive notation for smart pointers. They have great economy of syntax, despite being

extremely hard to use productively,
having only very narrow applicability,
extremely easy to misuse catastrophically

The greatest feature in all of this language, the implicit destructors that make the RAII (Resource Acquisition Is Initialization/returning or freeing a resource is its destruction) possible are not applicable to naked pointers. That is, naked pointers lead away from one of the greatest things in C++
Smart pointers, seeing from the point of view of standardization, are unacceptably recent additions:
1. unique_ptr, which can be a zero-performance-overhead smart pointer was added to the standard for '11, because it required move semantics to be added to the language itself, and then move semantics is a especially subtle concept.
2. shared pointers (that is, reference counted) were added in the TR1. Although the boost libraries have had a very good implementation of the shared_ptr, using them in practice used to have drawbacks: The shared_ptr requires a lot of compiler template juggling, which used to make compilation much slower, when the programmer made a mistake, hard to decipher template errors, etc.
3. With regards to intrusive pointers (pointers to objects that they themselves have the reference counter), unbelievably, there is no standard intrusive pointer;
4. auto_ptr was never useful

Almost everything is mutable by default, except compile-time constructs. I think there shouldn't be a "const" keyword, I mean it should be implicit. What should be made explicit is the mutability of named things. Not just changing the contents of memory is just about the most expensive processor operation in terms of performance hit, but it is orders of magnitude harder to reason about mutable things than immutable ones. That is, the compiler should have ways to express different degrees of immutability. There has been progress in this area, for example, "constexpr", but much remains to be done. This is a multifasceted topic that I'll probably cover over many articles, below you'll see two aspects in high detail: Aliasing and false sharing, that deal with mutability and the impossibility to express that things don't change. There are languages in which mutability is avoided in extreme ways, those on the functional programming paradigm. This is relevant to C++ in two ways:

Metaprogramming is amazingly very much functional programming in concepts and practice, especially including the feature of immutability
This language supports almost any sane approach to programming (except introspection, what is commonly referred to by the misnomer "reflection", because this requires substantial effort on the programmer to emulate), so the concepts can be applied. This is a digression, but there is a very interesting presentation by Sean Parent, "Inheritance Is The Base Class of Evil" on how to apply functional programming concepts and techniques in normal C++ that I hope I will treat at length some time.

On the other hand, and very paradoxically, compile time constructs are immutable. I am not advocating compile time constructs should be any different, just calling the fact that meta programming in C++ does not resemble normal programming at all, the building blocks are fundamentally different. This makes it hard even for experts to do metaprogramming, that is, even though it's possible to metaprogram in C++, which proves conclusively the richness and expressiveness of this language, it requires an altogether different set of skills.
To further complicate things, meta programming in C++ is based on a form of pattern matching that relies on the extremely subtle template specialization and especially confusing overload resolution mechanisms of C++; which effectively and thoroughly obfuscate the metaprogramming code. Again, I don't want to unfairly criticize, if I thought C++'s current framework to express smart pointers is nothing short of a major accomplishment of human civilization, I am convinced beyond doubt C++ metaprogramming capabilities are. C++ is so expressive that metaprogramming support, far from having been a design objective, was actually discovered into the language, invented and developed over 15 years.
Talking of template specializations and overload resolution, while extremely powerful features conceptually, their pathetic syntax leads to truly worrisome subtleties. Determining which specialization is picked by the compiler, or which overload, is probably the hardest thing in programming C++. It is not just hopelessly hard for automated tools, it is hard even for very experienced programmers! I wonder how come this is so hard and the compilers don't give you any help; I mean that for example, compilers give you a preprocessing option that shows you the macros expanded, why wouldn't the compiler give you the equivalent code with the specializations and overloads selected? We will talk more about this in the future. See "noexcept" below.
Bears mentioning how hopeless it is to make C++ tools. A code editor that will autocomplete a function call? allright, to begin with, is it a global or a member function? is there a template involved? are there specializations for that template?... to do it right, the tool must know what are the types of the arguments involved. The problem is that to be able to do that, the tool must essentially parse 100% of the features of the language. In due course I will collect apparently inoffensive code fragments in which it is hard to know what are the things involved.
"noexcept" and "constexpr" should be the default. The "hottest" thing in C++ today I would think is "value semantics". That you can express value semantics at all is proof of how expressive the language is, however, it gets tedious really quickly annotating yourself that things are const or constexpr and noexcept. At least noexcept is a compile time operator that will tell you whether an expression can throw...
Passing by value, by pointer, const pointer, reference, const reference, and "rvalue reference" all have their subtleties, and in general the set is not useful.

In my opinion, the only justification for naked pointer arguments is to indicate optionality, the argument address may be null, in any case, the callee does not assume any ownership.
Without optionality, only pass by value or a form of reference
Returning naked pointers only makes sense for optional things not owned by the caller
For return value or arguments to a function use smart pointers only to indicate ownership. While the current set of smart pointers allow to indicate the intended ownership, unfortunately it is not possible to dis-indicate optionality. That's one reason why I think the standard library should have intrusive pointers, this allows to convert a value to a smart pointer.

"Aliasing" means that the same thing is referred to as apparently different things. First, an example that uses concrete types for ease of explanation:

void incrementAll(std::vector<BigThing> &bt, const BigThing &value)
{
    for(auto &v: bt) { v += value; }
}

By "BigThing" is meant some data type that is expensive to copy around. For example, imagine an implementation of matrices as vectors of vectors. It is easy to see that passing as parameter a const reference to a big thing is the right thing to do, or is it not?.
The issue is that in the simple code given, there is no guarantee that the value referenced through "value" does not lie within the vector, that "value" is not an alias to an element in the vector, or even, to parts of up to two elements of the vector.
Unless the function is inlined (in this case it is not, it probably should be), and also the compiler has full information about the arguments "bt" and "value" at the caller site, then it would not be able to prove that the location of "value" is not within the memory buffer of the vector "bt". Unless the compiler seriously optimizes, it would not even be able to realize that the iteration of elements will happen over a sequential buffer, hence, it conceivably can insert code to check for the bounds of the buffer and determine whether the location of "value" may be within the bounds of the buffer for the vector, perform case switching, that is, to automatically transform the code into this:

template<
    typename V
> void vectorizeIncrement(V *buffer, const V *limit, const V *val);
    // do in parallel *ptr += *val
    // for ptr in the range [buffer, limit - sizeof(V)]
    // and assuming that *val does not change

void incrementAll(std::vector<BigThing> &bt, const BigThing &value) {
    auto vectorBuffer = bt.data();
    auto end = vectorBuffer + bt.size();
    auto valueLocation = &value;
    if(valueLocation < vectorBuffer or end < valueLocation) {
        vectorizeIncrement(vectorBuffer, end, valueLocation);
        // vectorizeIncrement would be something reasonable that attempts
        // to apply the increment to many elements in parallel
    } else { // "value" is going to be changed despite being "const"!
        // note that a "BigThing" at "valueLocation" may legitimately straddle
        // two "BigThing" elements in the vector...
        intptr_t
            valueLocationAsInteger = intptr_t(valueLocation),
            bufferAsInteger = intptr_t(vectorBuffer),
            diff = valueLocationAsInteger - bufferAsInteger,
            size = sizeof(BigThing),
            straddleBeginningIndex = diff / size,
            straddleEndingIndex = (diff + size - 1) / size; // integer ceiling
        vectorizeIncrement(
            vectorBuffer, vectorBuffer + straddleBeginningIndex, valueLocation
        );
        vectorBuffer[straddleBeginningIndex] += *valueLocation;
        if(straddleBeginningIndex != straddleEndingIndex) {
            vectorBuffer[straddleEndingIndex] += *valueLocation;
        }
        vectorizeIncrement(
            vectorBuffer + straddleEndingIndex + 1, end, valueLocation
        );
    }
}

The problem is that it is fundamentally impossible to know at compilation whether the extra work of doing the case analysis for whether the location of value is in the vector buffer, and this is still assuming that the compiler knows that the BigThing &operator+=(const BigThing &) does not have a side effect on either the vector or the value!!

Profile-guided optimization can certainly help, but even then, what's the guarantee that the profile cases will be representative of production use long after the code has been released?

Something as apparently as simple as telling the compiler "I am giving you the address only to prevent copying, but trust me, the contents do not change" is inexpressible in current C++. "const" is only a restriction on your code, it means you prohibit yourself from changing the values declared const explicitly, but that says nothing about who else can change the values, such as a different thread, nor whether they can change implicitly, as a side effect of a function call.

Solving this goes way beyond incorporating "restrict", because unless inlined, any function call can have any arbitrary side effect; on the other hand, the only guaranteed solution to aliasing expressible within the current language, which is to pass by value and use local copies of values until the changes are written back at the end, is both a heavy handed semantic change and a performance pit.

The final remarks about these complications are:

Don't manage naked pointers, instead link your pointers to things that will delete them on their destructors, for example, smart pointers.
Take the trouble to constify and constexprify and noexceptify
Endure the pain of manually making your code less and less mutable, it pays off
When in doubt, err on the side of passing by value
Even if you are a not a sophisticate, learn enough of metaprogramming so that the compile time immutability tools help you accomplish guarantees of immutability in your practical code.

Tuesday, June 23, 2015

Is garbage collection the worst idea in all of computing?

An alternative name for this post could be "The constructor/destructor pair is the most powerful construct in programming languages"

Before getting into the subject matter, I must clarify that I mean by "Garbage Collectors" (GCs) exclusively to the tracing garbage collectors, that is, the mechanism that tracks which dynamically allocated objects are accessible, as opposed to reference counting.

I have used extensively languages that rely on GCs for dynamic memory management, including Java, JavaScript, Python, and also used extensively several languages such as C++ that don't. My conclusion is they are worse than useless, actually harmful. The more experience I gain, the more confirmation of this opinion, I will explain how garbage collection hurts.

Almost all modern programming languages rely on this concept to manage dynamic memory. However, the arguments are so clear that I will incur the risk of embarrassing myself going against the grain.

Managing Dynamic Memory

First, let us be constructive, what is a good way to manage dynamic memory?

It depends, naturally. There are many legitimate uses for dynamic memory that differ in their nature, they each should be handled in the way most appropriate for them.

For non-recursive data there is a basic way to handle memory, through reference counting, which will suffice. For non-recursive data I mean data that does not end up referring back to itself through a chain of outgoing references. For example, a well formed singly linked list is in this sense not recursive:

template<typename T> struct SinglyLinkedListNode
{
    T value;
    SinglyLinkedListNode *next;
};

Although the definition of SinglyLinkedListNode refers to itself, instances will not refer to themselves, neither directly, nor indirectly. The situation is different for a doubly linked list: Each element in the list refers indirectly to itself by the successor of the previous link or the predecessor of the successor. Since self-referential data occurs very often in practice, there is the need to cover it satisfactorily, which I will.

An interesting special case in C++ 11 is when the software designer makes it so that there is only one owner to a dynamic memory object (even if it is recursive), in such a case, the owner does not even have to keep a reference count; through language mechanisms what is tracked is the ownership of the memory object. I refer to the unique_ptr facility.

I just mentioned the concept of ownership of dynamic memory. Perhaps my view is linked to the way C++ understands dynamic memory, but I think it is no abuse to assume there is an owner to new dynamic memory, either the code that brought the new object into existence, or the data structure that required it. The owner can later decide whether to give ownership to other entities or to withdraw from owning the dynamic memory.

If ownership does not change, there isn't a need for management either, at least not beyond the responsibility to cleanup. For this, the standard library smart pointer unique_ptr will suffice. They are also very good to single-ownership that may change. And their execution cost of supporting change of ownership is simply nullifying a pointer variable; however, I am not sure even this is necessary since the compiler may determine at the place where the ownership transfer occurs that there is no need to clean that up afterward. Note: unique_ptr offers the option of an specific "deleter", with its own performance cost, the point is not whether there are expensive ways to manage dynamic memory, the point is that there is a viable way with zero cost to do it that covers a substantial fraction of the legitimate uses of dynamic memory: The case of clear single-ownership.

Then there are the cases of intended shared ownership of a value, of course this is already supported by the standard library through std::shared_ptr. There is also the alternative of smart pointers that let the instances themselves have the information to keep track the sharing, such as the reference count, that although not in the standard library is easy to express in the language, as for example in the "boost" libraries intrusive_ptr. Whenever sharing instances, this case potentially includes recursive references, that is, a sequence of references in which entity A references B, and so on, until some C references back to A. There are many ways to deal with this issue, for example, the designer can use knowledge of the data structure to decide to make some references non-owning. Hence, the software can be designed in such a way that there is a subset of the references that are ownership references. This can be expressed in C++ through std::shared_ptr and std::weak_ptr.

People may object that "intrusive" solutions, that is, that force the designer to do things in particular ways, are a form of undesirable micro management of implementation details. Since a garbage collector tracks accessibility of all the things in a program, it relieves of having to specify how the references should be accounted for, there is no need for the programmer to ask "is this an ownership reference or not?". That is a valid objection. Furthermore, implementing the concept of "accessibility" with pure C++ resources is nearly impossible because the language does not have introspection capabilities (what is technically referred to as "reflection", which I think is a misnomer) to query things for what they might be accessing. However, the language is expressive enough that it is possible to create type adapters that provide the introspection facilities needed to implement accessibility and thus emulation of garbage collection is feasible in C++, just not for any arbitrary types, nor without some effort. Still, the objective is not to prove that if the programmer wants tracing garbage collection they can have it in C++, but to prove GCs are no good. That is, if a programmer does not want to specify which references are owning, that may be legitimate, and it can be supported in C++. But is it legitimate in the strong sense of creating production, critical software without them?

There is an unresolved question: whether software designers should have the freedom to not having to specify the nature of their references. My own opinion is that they should; however, in the practical sense, I seldom work with potentially recursive data structures (recursive in the sense that dynamic objects end of referring to themselves), and even when I do, I find easy the task of deciding on the nature of the references. It can be that most of the software I make is very conscious of performance and thus I am more often needing ways to tell exactly what I want, then the task of deciding on the nature of references is almost inherent for me. I think the opinion of requiring some minimum of detail from the designer about what the code does is legitimate. Is an implementation really complete if it can't even decide whether there can be valid recursion, or without having identifying owning and non-owning references? I submit that the question is valid. My own opinion is affirmative (an implementation is complete even without specifying the nature of references), but I don't find burdensome to always have to decide the nature of the references. Thus, although I have not proved whether it is legitimate to not specify the nature of references, in practice I've seen it is not burdensome to do it.

An important characteristic of C++ is that it lets programmers specify the desired mechanisms to accomplish anything to almost ridiculous levels of fine grained detail, and sometimes, even to specify mechanisms that are less specific than what the language allows for arbitrary types. One example is the Variant types, which might be implemented using the technique of type erasure; that is, Variant types are such that the designer willingly reduces the specificity of the type. Another example would be to implement introspection capabilities and garbage collection on values of those types with introspection capabilities.

Then there is the cost of completely relieving the programmer from identifying the nature of references. Not just the performance cost, which is significant, but other inherent costs, including non-determinism and how this increases the propensity of errors managing resources.

The first issue, of non-determinism, occurs because if the programmer did not make explicit the lifetime of the dynamic memory, the execution environment can not manage the memory deterministically. This changes the observable results of a program if dynamically allocated memory is associated to resources: For example, a file may be closed before or after the program accesses a database, only because the dynamic memory object associated with the file was reclaimed before or after the data base access.

This leads to the crux of the problem: In my opinion garbage collection makes programs inherently unreliable. To prove this opinion, a detour:

Memory is just one kind of resource. Actually, the most homogeneous and easy to manage of all types of resources; as opposed to files, databases, network connections, user interactions, process/thread synchronization objects.

If the software is not a pure computation, such as an artificial intelligence application looking for an improved algorithm, but something that is influencing the real world as it runs; then the software will need to manage things that make changes in the real world; these are modelled as resources in software. Languages that have garbage collection tend to not provide any mechanism for explicit management of memory, and by extension, any type of resource; in particular, if a language with garbage collecting would offer very good mechanisms to deterministically handle the lifetime of resources, then those mechanisms would be practical for memory too, negating the value of garbage collection. In practice, they simply don't. It is actually hard to guarantee in languages with garbage collection something as trivial as that database connections will be closed before some other thing happens in the program; it is hard, because the guarantee must be coded explicitly, manually, and implementation detail changes in practice routinely break those guarantees. The end result expresses in practice in things like you unblock the screen of your smartphone and the thing freezes for several seconds, only because unlocking the screen required a little bit more memory than what was available, and this triggered the deep garbage collection. Too bad you wanted to take a picture and the moment passed while the smartphone was in garbage collection.

If a language does not have garbage collection, it forces the programmer to at least identify ownership relationships. This can be done correctly, it is just a higher level of specificity as was described before.

In particular, C++ is absolutely wonderful in supporting resources: It has the concept of the destructor, which offers the symmetrical guarantees the constructors provide. It baffles me that object oriented experts can readily understand the need to establish the class invariants (the constructors) but fail to realize the symmetrical need to guarantee the de-construction of objects.

In C++, resources can be handled through their ownership relationships: When the lifetime of the last owner alive of a resource comes to its end, the lifetime of the resource also comes to an end. This ownership relationship is every bit as powerful as the relationship of accessibility; as a matter of fact, it arguably is the very same relationship. Just that the concept of destructor makes it so that this relationship is managed deterministically.

In the interest of completeness, given the strength of my claims, I must refer to cases in which reference counting is harmful as compared to reconstructing the accessibility relationship. One example: Imagine a recursive function in which the tail recursion optimization is applicable. Furthermore, imagine the most nested invocation of the function uses a data item that refers to a data item in the calling context: Since the function is being terminated, the compiler knows at once that all of the stack frames are being destroyed; however, it won't destroy them all at once, but one by one: The destructor of the data item in the most nested invocation will trigger the destruction of the calling frame and so on.

Going back to the issue of management of resources, my C++ experience proves the concept of ownership is sufficient to drive reliable management of resources. In the absence of this concept, or its excellent support in C++, there is no alternative but to implement the management of resources in ad-hoc ways, that is, inherently error-prone ways.

In the end, ownership relationships require the programmer to specify them. Should they do, then the software will almost always outperform garbage collected alternatives, and will be much more reliable

TL;DR;:
There are legitimate cases in which garbage collection is superior, however, in practice it is truly important that the language offers good facilities to handle the lifetime of resources; garbage collection assumes the opposite position that the lifetime of the simplest of resources, memory, can be reliably handled by the execution environment, thus, in garbage collected languages, all sorts of resources end up being mismanaged, making them inherently unreliable.