Monday, November 30, 2015

Introspection, preprocessing and boost preprocessing

I was reading our blogspot neighbor Scott Meyers, “The Brick Wall of C++ Source Code Transformation”, where he discusses one of the worst problems in C++: The overreliance on the preprocessor. For a few years now I have been using the preprocessor to palliate the deficiencies of the introspection capabilities. I hope that would be another class of legitimate use of the preprocessor such as to #include headers because there is no concept of modules and simple management of versions through conditional compilation, like when one asks if the code is being compiled with G++, #if defined __GNUG__. I am not entirely sure this use is legitimate, but it seems the least bad of the practical options. Let us get to know the animal --monster-- of the preprocessor in more detail, to appreciate its cost this time. In later articles I hope to make good exhibits for the zoo, in which this animal shows its potential.

What is the problem?

The preprocessor is a source code transformation language, the problems are that it is a very poor language and inserting a source code transformation layer between the programmer and the compiler makes their lives much harder. Concretely, it makes making tools hard to nearly impossible, makes compilation much slower, it makes source code much harder to understand and it invites problems.

Headers

Although the language does not support modules we still need to import their interfaces, that’s why we speak of declarations and definitions. For example, #include is equivalent to copying and pasting the included file into the current file. However, the declarations in the included header are affected by the context carried over, and this opens a can of worms. Let’s say in file header1.h there is this code:
namespace library1 {

// some declarations
#define ITEM product
// more declarations, ITEM is never undefined

}
Later, another header may have this content:
namespace library2 {

// some declarations
struct ITEM {
    void not_inline_function();
    // …
};

// more declarations
}

Then, if a file.cpp has this content:
#include “header1.h”
#include “header2.h”

Every mention to “ITEM” will in reality refer to a product. The program may even compile, since the declaration struct ITEM will be silently converted to struct product in a different namespace, and within that namespace, all the mentions to ITEM will be consistently changed to product. It may even link!, if all the uses of ITEM are for data members and inlined functions. However, months later some change happens to the code, somebody uses the member function not_inline_function, and gets stomped with the link error undefined reference to not_inline_function

Back when people were switching from 32 bits to 64 bits, it was frequent for code to assume a pointer and an integer were of the same size, so, there was lots of people that would #define int long and rebuild. Of course, things may appear to work, the developer may be praised for how fast he did the migration, but things will eventually explode catastrophically because some “who knows what” library function not recompiled assumed 32 bits for an integer and it got 64.

Perhaps my examples are not good, in any case, the point I am trying to make is that a header and its corresponding binary library must be exactly in sync, otherwise, things may seem to work until they explode; because headers are vulnerable to the context in which they are #included, going through the preprocessor introduces all sorts of nasty possibilities for subtle errors.

Another problem, of a practical nature, is that the compiler is forced to compile the #included headers every time they appear, because since the interpretation of a header depends on the exact code that precedes its inclusion, there is no remedy but to do it all over! Note: this is what makes precompiled headers worthless: To guarantee the headers will all have the same meaning, the programmers must put together all the headers they could potentially include into a single ‘universe’ header, which is big, fat, greasy with leaking references to implementation details, couplings to implementation details and very low cohesiveness. Then, in all places were well thought-off headers would be included, to include the universe instead.

Versions

The other legitimate preprocessor case I mentioned of conditional compilation and versioning suffers from not having a way to guarantee that a prebuilt binary is compatible with the current source code and compilation options, leading in practice to the wasteful re-compiling and re-linking; or much worse, that apparently things work but in reality have catastrophic bugs: Let’s say a.cpp gets compiled assuming struct Foo has a member int member but because of subtle interplay of conditional compilation, in b.cpp struct Foo may have a member long member. Things may appear to be fine, but the layout of Foo is different, somewhere some code may think it is X bytes long, and some other Y bytes long, and then all bets are off.

These are not new problems, they are actually over forty years old. The culprit is that the language does not offer any way to guarantee declarations for library users will respect the assumptions made in their compiled/linked binary implementations. It is truly embarrassing, and I’ve not seen any prospects for a solution, not even for C++17 when some form of modules are being discussed.

Toolability

Because of language deficiencies, we still need to use the preprocessor, and this means that the only tools that can be developed for the language must in some way or other be also compilers. For example, imagine an IDE tool that wants to help with autocompletion of function calls. To be useful, the tool must be able to add applicable functions that are the result of macro expansions. Thus the tool must be able to preprocess, on top of all the other pure language requirements. It also must be able to know which macro expansions correspond to what places in the source code. What about function-like macros? when doing auto completion, is it going to offer the macro or only its expansion?

Introspection

I hope the claim that the preprocessor is a real problem has been substantiated, however, for all of its expressiveness the language has deficiencies that are covered by the preprocessor, thus it may be legitimate to continue to alleviate the deficiencies through the preprocessor. A consistent set of introspection features is sorely lacking. By introspection I mean what commonly is referred to as “reflection”, when there are ways in which programming constructs can reason about themselves. For example, in Java it is trivial to ask a class for its data and function members; it is possible to know everything there is to know about them. In C++ there is a collection of ad-hoc introspection capabilities, such as whether dynamic_cast of a pointer returns zero or not, the typeid, the sizeof operator and the very powerful collection of pattern matching capabilities of templates. For all of their might, it is simply not possible to know even the number of data members in a plain structure. It is not possible to know, from within the program itself, the count of enumerations in an enumerated type, nor what are the identifiers used, nor the mapping of enumerations to integer values, nor their reverse mapping.

The only option to accomplish introspection capabilities is to develop idioms or conventions so that we can plug some of the capabilities mentioned above.

The use case that led me down the path of using boost preprocessing was the simple need of converting enumerated values to strings. For example:
enum Enumeration {
    VALUE1 = 9,
    VALUE2 = 8,
    VALUE3 = 2
};

How to implement something like Enumeration::count() that will tell you the number of enumerated values, or Enumeration::values(), or Enumeration::identifiers()? This would need to be done manually. Typically, one wants to make something like std::ostream &operator<<(std::ostream &output, Enumeration value);, this is what we repeatedly waste our time doing:
std::ostream &operator<<(std::ostream &output, Enumeration e) {
    switch(e) {
        case VALUE1: output << “VALUE1”; break;
        case VALUE2: output << “VALUE2”; break;
        case VALUE3: output << “VALUE3”; break;
    }
    return output;
}

It does not look like much work, and it isn’t, but the real problem is that some day somebody will want to put VALUE4=13, forget to map it, and then the code won’t work. Also, somebody will just copy and paste forgetting to change the number:
std::ostream &operator<<(std::ostream &output, Enumeration e) {
    switch(e) {
        case VALUE1: output << “VALUE1”; break;
        case VALUE2: output << “VALUE1”; break;
        case VALUE3: output << “VALUE1”; break;
    }
    return output;
}

The compiler is happy. Your tests will be happy too, but in the critical log entry for your application, whenever VALUE2 appeared you’ll see VALUE1 written and be mystified…

Doctrine

I don’t like guranteeing relationships between software components manually, it is the same thing as using explicit new and delete instead of smart pointers; it is error prone. I like C++ because it lets me express fairly non-trivial relationships so that the compiler will guarantee them for me. When pure C++ is not sufficient, I try hard to find the least meta-C++ before resorting to do things manually. This element of my doctrine has served me very well. Even if my choices to represent relationships look very ugly, at least they are expressed, subject to improvement, and handled automatically; over time, things look less ugly, because you learn how to express yourself better, reducing accidental complexity and the inherent complexity is reduced once you realize it is necessary.

None of these happen with manual coding of relationships.

In general, I have reached to the conclusion that lists of identifiers are inexpressible in pure C++. Thus we have to look beyond for ways to handle lists of identifiers. Is it possible to express them using the preprocessor?

Boost preprocessing

Four years ago I discovered boost preprocessing while getting acquainted with the code of a system new to me, a genius coworker had coded a general solution to declare, define and print enumerations which used boost preprocessing, and at the same time, before C++11 his solution was capable of expressing enumerations in their own scope and based off user-specified integer types.

My old coworker got to different choices to what I will propose, but the idea is essentially the same. Let’s work toward the refined solution I made, illustrating each objective and solution.

Let us begin with the enumeration shown above. To be able to “reason” about that enumeration, at the bare minimum we need something that converts enumerated values to their identifier as a string. We need little else; these few things can be put together in an structure, the way in which we typically bind together related compile-time constructs. Let us associate the enumerated values to their strings by using a simple std::array:
#include <array>

struct demo {
    using integral_type = unsigned char;

    enum enumeration {
        VALUE1 = 8,
        VALUE2 = 9,
        VALUE3 = 2
    };

    using map_type = std::unordered_map<integral_type, const char *>;

    static constexpr unsigned count = 3;

    static constexpr std::array<typename map_type::value_type, count> values =
    {{
        { VALUE1, "VALUE1"}, { VALUE2, "VALUE2" }, { VALUE3, "VALUE3" }
    }};
};

demo has integral_type, enumeration, map_type, count and values, that I think serve a very clear role. Pure C++ won’t let us express a relationship between enumeration and values, but there is a way, which we will illustrate momentarily. In any case, with these we can implement any number of interesting introspection capabilities:
  1. Represent enumerated values using the integral_type
  2. All of the value-type services expected: construction, comparison, etc.
  3. It is possible to convert enumerated values to strings and viceversa
  4. Iteration over all the valid values of the enumeration
  5. All of these operations can be implemented to run very fast and potentially as compile-time constructs, that is, as if the language provided these features out of the box.
Let us augment the demo struct with many straightforward features through a template that combines some of the members into useful things. Please observe that indeed we got away with making almost all the things constexpr, noexcept:
#include <unordered_map>

namespace zoo {

template<typename E> struct smart_enumeration: E {
    using typename E::integral_type;
    using typename E::enumeration;

    integral_type code;

    constexpr smart_enumeration() noexcept: code(E::values[0].first) {}

    explicit constexpr
    smart_enumeration(integral_type v) noexcept: code(v) {}
    
    smart_enumeration(const smart_enumeration &) = default;
    
    explicit constexpr
    smart_enumeration(enumeration e) noexcept: code(e) {}

    smart_enumeration &operator=(enumeration e) noexcept
    { code = e; return *this; }

    constexpr operator integral_type() const noexcept { return code; }

    constexpr bool operator==(smart_enumeration e) const noexcept
    { return code == e.code; }
    constexpr bool operator!=(smart_enumeration e) const noexcept
    { return not (*this == e); }

    operator const char *() const {
        static const auto copy = E::values;
            // Note: this copy prevents having to "define" E::values
        static const std::unordered_map<integral_type, const char *>
            mapping(copy.begin(), copy.end());
        auto resultFind = mapping.find(code);
        if(mapping.end() == resultFind) { return nullptr; }
        return resultFind->second;
    }

    bool valid() const { return nullptr == static_cast<const char *>(*this); }
};

}

Note: Yeah, none of the constexpr, noexcept, and even const above are superflous: The compiler does not automatically allow the use of a non-constexpr function in compile-time expressions, nor deduces noexcept automatically; and because it was deemed a small mistake of C++ 11 that constexpr implies const in C++ 14 it doesn’t. It is a shame since all of these annotations add clutter and are deducible by the compiler.

Note2: The conversion to string is implemented not as a compile-time function because it uses an unordered_map; however, with enough effort it is possible to implement a compile-time map, even in C++11, just that it is not practical.

Now, the implementation of things like the insertion operation:
#include <istream>

namespace zoo {

template<
    typename E
> std::ostream &operator<<(std::ostream &output, zoo::smart_enumeration<E> e) {
    const char *ptr = e;
    if(ptr) { output << ptr; }
    return output;
}

}

Let us write a non-inline function that the compiler is forced to translate to drive enough of the implementations we have, also a “main” to prove we don’t have linking issues:
#include <iostream>

void drive(zoo::smart_enumeration<demo> d) {
    using se = zoo::smart_enumeration<demo>;
    static_assert(se().code == 8, "");
        // constexpr default constructor
    static_assert(noexcept(se()), "");
        // the default constructor is noexcept
        // Proven the constructor by integral and by enumeration are
        // constexpr and noexcept
    static_assert(noexcept(se(d)), ""); // default copy constructor is noexcept
    static_assert(9 == se(se(demo::VALUE2)).code, ""); // also constexpr

    static_assert(noexcept(se(8) == se(9)), ""); // equality comparison noexcept
    static_assert(noexcept(se(8) != se(9)), ""); // different is also noexcept
    static_assert(!noexcept(static_cast<const char *>(se())), "");
        // however, the conversion to const char * can throw
    std::cout << d << d.valid(); // All compiles
}

int main(int argc, const char *argv[]) { return 0; }

So, if we are able to supply the members that smart_enum requires, the same we put in demo, then forever we will be able to automatically get all the other stuff, and use our proto-introspection to implement lots of other things. For example, the backward map (string to enumeration) implemented in construct_smart_enum(std::string):
namespace zoo {

template<
    typename F, std::size_t N
> std::unordered_map<std::string, F> backward_map(
    std::array<std::pair<F, const char *>, N> argument
) {
    std::array<std::pair<std::string, F>, N> initializer;
    for(unsigned i = argument.size(); i--; ) {
        initializer[i].first = argument[i].second;
        initializer[i].second = argument[i].first;
    }
    return
        std::unordered_map<std::string, F>(
            initializer.begin(), initializer.end()
        );
}

template<typename E> smart_enumeration<E> construct_smart_enum(std::string s) {
    using se = smart_enumeration<E>;
    const static auto copy_for_gcc = E::values;
    const static auto reverted_map = backward_map(copy_for_gcc);
    return se(reverted_map.find(s)->second);
}

}

And the test, that also shows a possible way to use it:
smart_enumeration<demo> something()
{ return construct_smart_enum<demo>("VALUE2"); }

There only remains to show the magical incantation that will bind the identifiers to the enumeration values:
PP_SMART_ENUMERATION(
    demo,
    unsigned char,
    ((VALUE1, 8))((VALUE2, 9))((VALUE3, 2))
);

The expansion of that macro call will generate the code we wrote for demo above. This is the prestidigitation I just did in slow motion:
#include <boost/preprocessor/seq/for_each.hpp>
#include <boost/preprocessor/seq/size.hpp>

#define PP_SEQ_META_CALL(r, MACRO, element) MACRO element

#define PP_MAKE_MAP_PAIR(identifier, value) { value, #identifier },
#define PP_MAKE_ENUMERATION_PAIR(identifier, value) identifier = value,

#define PP_SMART_ENUMERATION(name, integer_type, identifiers)\
struct name {\
    using integral_type = integer_type;\
    enum enumeration {\
        BOOST_PP_SEQ_FOR_EACH(PP_SEQ_META_CALL, PP_MAKE_ENUMERATION_PAIR, identifiers)\
    };\
    using pair_type = std::pair<integral_type, const char *>;\
    static constexpr unsigned count = BOOST_PP_SEQ_SIZE(identifiers);\
    static constexpr std::array<pair_type, count> values = {{\
        BOOST_PP_SEQ_FOR_EACH(PP_SEQ_META_CALL, PP_MAKE_MAP_PAIR, identifiers)\
    }};\
}

Not entirely self-explanatory? of course, and I have not even gone over the things that make this code break, but this is a great opportunity to leave the article in a cliff hanger…

No comments:

Post a Comment