Tuesday, June 23, 2015

Is garbage collection the worst idea in all of computing?

An alternative name for this post could be "The constructor/destructor pair is the most powerful construct in programming languages"

Before getting into the subject matter, I must clarify that I mean by "Garbage Collectors" (GCs) exclusively to the tracing garbage collectors, that is, the mechanism that tracks which dynamically allocated objects are accessible, as opposed to reference counting.

I have used extensively languages that rely on GCs for dynamic memory management, including Java, JavaScript, Python, and also used extensively several languages such as C++ that don't. My conclusion is they are worse than useless, actually harmful.  The more experience I gain, the more confirmation of this opinion, I will explain how garbage collection hurts.

Almost all modern programming languages rely on this concept to manage dynamic memory.  However, the arguments are so clear that I will incur the risk of embarrassing myself going against the grain.

Managing Dynamic Memory

First, let us be constructive, what is a good way to manage dynamic memory?

It depends, naturally.  There are many legitimate uses for dynamic memory that differ in their nature, they each should be handled in the way most appropriate for them.

For non-recursive data there is a basic way to handle memory, through reference counting, which will suffice. For non-recursive data I mean data that does not end up referring back to itself through a chain of outgoing references. For example, a well formed singly linked list is in this sense not recursive:
template<typename T> struct SinglyLinkedListNode
{
    T value;
    SinglyLinkedListNode *next;
};
Although the definition of SinglyLinkedListNode refers to itself, instances will not refer to themselves, neither directly, nor indirectly.  The situation is different for a doubly linked list:  Each element in the list refers indirectly to itself by the successor of the previous link or the predecessor of the successor.  Since self-referential data occurs very often in practice, there is the need to cover it satisfactorily, which I will.

An interesting special case in C++ 11 is when the software designer makes it so that there is only one owner to a dynamic memory object (even if it is recursive), in such a case, the owner does not even have to keep a reference count; through language mechanisms what is tracked is the ownership of the memory object.  I refer to the unique_ptr facility.

I just mentioned the concept of ownership of dynamic memory.  Perhaps my view is linked to the way C++ understands dynamic memory, but I think it is no abuse to assume there is an owner to new dynamic memory, either the code that brought the new object into existence, or the data structure that required it.  The owner can later decide whether to give ownership to other entities or to withdraw from owning the dynamic memory.

If ownership does not change, there isn't a need for management either, at least not beyond the responsibility to cleanup.  For this, the standard library smart pointer unique_ptr  will suffice.  They are also very good to single-ownership that may change.  And their execution cost of supporting change of ownership is simply nullifying a pointer variable; however, I am not sure even this is necessary since the compiler may determine at the place where the ownership transfer occurs that there is no need to clean that up afterward.  Note: unique_ptr offers the option of an specific "deleter", with its own performance cost, the point is not whether there are expensive ways to manage dynamic memory, the point is that there is a viable way with zero cost to do it that covers a substantial fraction of the legitimate uses of dynamic memory:  The case of clear single-ownership.

Then there are the cases of intended shared ownership of a value, of course this is already supported by the standard library through std::shared_ptr.  There is also the alternative of smart pointers that let the instances themselves have the information to keep track the sharing, such as the reference count, that although not in the standard library is easy to express in the language, as for example in the "boost" libraries intrusive_ptr.  Whenever sharing instances, this case potentially includes recursive references, that is, a sequence of references in which entity A references B, and so on, until some C references back to A.  There are many ways to deal with this issue, for example, the designer can use knowledge of the data structure to decide to make some references non-owning.  Hence, the software can be designed in such a way that there is a subset of the references that are ownership references.  This can be expressed in C++ through std::shared_ptr and std::weak_ptr.

People may object that "intrusive" solutions, that is, that force the designer to do things in particular ways, are a form of undesirable micro management of implementation details.  Since a garbage collector tracks accessibility of all the things in a program, it relieves of having to specify how the references should be accounted for, there is no need for the programmer to ask "is this an ownership reference or not?".  That is a valid objection.  Furthermore, implementing the concept of "accessibility" with pure C++ resources is nearly impossible because the language does not have introspection capabilities (what is technically referred to as "reflection", which I think is a misnomer) to query things for what they might be accessing.  However, the language is expressive enough that it is possible to create type adapters that provide the introspection facilities needed to implement accessibility and thus emulation of garbage collection is feasible in C++, just not for any arbitrary types, nor without some effort.  Still, the objective is not to prove that if the programmer wants tracing garbage collection they can have it in C++, but to prove GCs are no good.  That is, if a programmer does not want to specify which references are owning, that may be legitimate, and it can be supported in C++.  But is it legitimate in the strong sense of creating production, critical software without them?

There is an unresolved question: whether software designers should have the freedom to not having to specify the nature of their references.  My own opinion is that they should; however, in the practical sense, I seldom work with potentially recursive data structures (recursive in the sense that dynamic objects end of referring to themselves), and even when I do, I find easy the task of deciding on the nature of the references.  It can be that most of the software I make is very conscious of performance and thus I am more often needing ways to tell exactly what I want, then the task of deciding on the nature of references is almost inherent for me.  I think the opinion of requiring some minimum of detail from the designer about what the code does is legitimate.  Is an implementation really complete if it can't even decide whether there can be valid recursion, or without having identifying owning and non-owning references?  I submit that the question is valid.  My own opinion is affirmative (an implementation is complete even without specifying the nature of references), but I don't find burdensome to always have to decide the nature of the references.  Thus, although I have not proved whether it is legitimate to not specify the nature of references, in practice I've seen it is not burdensome to do it.

An important characteristic of C++ is that it lets programmers specify the desired mechanisms to accomplish anything to almost ridiculous levels of fine grained detail, and sometimes, even to specify mechanisms that are less specific than what the language allows for arbitrary types.  One example is the Variant types, which might be implemented using the technique of type erasure; that is, Variant types are such that the designer willingly reduces the specificity of the type.  Another example would be to implement introspection capabilities and garbage collection on values of those types with introspection capabilities.

Then there is the cost of completely relieving the programmer from identifying the nature of references.  Not just the performance cost, which is significant, but other inherent costs, including non-determinism and how this increases the propensity of errors managing resources.

The first issue, of non-determinism, occurs because if the programmer did not make explicit the lifetime of the dynamic memory, the execution environment can not manage the memory deterministically.  This changes the observable results of a program if dynamically allocated memory is associated to resources:  For example, a file may be closed before or after the program accesses a database, only because the dynamic memory object associated with the file was reclaimed before or after the data base access.

This leads to the crux of the problem:  In my opinion garbage collection makes programs inherently unreliable.  To prove this opinion, a detour:

Memory is just one kind of resource.  Actually, the most homogeneous and easy to manage of all types of resources; as opposed to files, databases, network connections, user interactions, process/thread synchronization objects.

If the software is not a pure computation, such as an artificial intelligence application looking for an improved algorithm, but something that is influencing the real world as it runs; then the software will need to manage things that make changes in the real world; these are modelled as resources in software.  Languages that have garbage collection tend to not provide any mechanism for explicit management of memory, and by extension, any type of resource; in particular, if a language with garbage collecting would offer very good mechanisms to deterministically handle the lifetime of resources, then those mechanisms would be practical for memory too, negating the value of garbage collection.  In practice, they simply don't.  It is actually hard to guarantee in languages with garbage collection something as trivial as that database connections will be closed before some other thing happens in the program; it is hard, because the guarantee must be coded explicitly, manually, and implementation detail changes in practice routinely break those guarantees.  The end result expresses in practice in things like you unblock the screen of your smartphone and the thing freezes for several seconds, only because unlocking the screen required a little bit more memory than what was available, and this triggered the deep garbage collection.  Too bad you wanted to take a picture and the moment passed while the smartphone was in garbage collection.

If a language does not have garbage collection, it forces the programmer to at least identify ownership relationships.  This can be done correctly, it is just a higher level of specificity as was described before.

In particular, C++ is absolutely wonderful in supporting resources:  It has the concept of the destructor, which offers the symmetrical guarantees the constructors provide.  It baffles me that object oriented experts can readily understand the need to establish the class invariants (the constructors) but fail to realize the symmetrical need to guarantee the de-construction of objects.

In C++, resources can be handled through their ownership relationships:  When the lifetime of the last owner alive of a resource comes to its end, the lifetime of the resource also comes to an end.  This ownership relationship is every bit as powerful as the relationship of accessibility; as a matter of fact, it arguably is the very same relationship.  Just that the concept of destructor makes it so that this relationship is managed deterministically.

In the interest of completeness, given the strength of my claims, I must refer to cases in which reference counting is harmful as compared to reconstructing the accessibility relationship.  One example: Imagine a recursive function in which the tail recursion optimization is applicable.  Furthermore, imagine the most nested invocation of the function uses a data item that refers to a data item in the calling context:  Since the function is being terminated, the compiler knows at once that all of the stack frames are being destroyed; however, it won't destroy them all at once, but one by one:  The destructor of the data item in the most nested invocation will trigger the destruction of the calling frame and so on.

Going back to the issue of management of resources, my C++ experience proves the concept of ownership is sufficient to drive reliable management of resources.  In the absence of this concept, or its excellent support in C++, there is no alternative but to implement the management of resources in ad-hoc ways, that is, inherently error-prone ways.

In the end, ownership relationships require the programmer to specify them.  Should they do, then the software will almost always outperform garbage collected alternatives, and will be much more reliable

TL;DR;:
There are legitimate cases in which garbage collection is superior, however, in practice it is truly important that the language offers good facilities to handle the lifetime of resources; garbage collection assumes the opposite position that the lifetime of the simplest of resources, memory, can be reliably handled by the execution environment, thus, in garbage collected languages, all sorts of resources end up being mismanaged, making them inherently unreliable.


Monday, June 8, 2015

Introduction

C++ is extremely difficult to master, but it is worth mastering.

This new blog will deal mostly with how to master it and why it is worth.

I hope it will be useful for people for us to write the things that took us lots of efforts to discover and understand.  There is enough reference information out there, but not so much guidance about how to use the extremely quirky feature set of this language.

I think the experts stand to benefit from sharing their hard earned lessons with the community. There are some concepts, techniques, that to present them to the general public requires us to transcend their level of understanding from merely the intuition to use it in code to the level of mastery.

Also, C++ suffers from a not sufficiently developed community, that's why the tools for C++ are not capable of doing what the tools for other languages are capable of doing. Yes, this is a much more complicated language, thus it is to be expected the tools are going to be harder to develop, however, the fact remains the tools can't do what they do for other languages and thus a stronger community is better; also, there is a lot of noise: People misusing features easy to misuse hitting subtle problems that drag the community.

Finally, I hope using the rich feature set of C++ will be a creative experience for the users, and as a creative experience, the unexpected and interesting is going to emerge. If words in this blog helped you to discover something interesting, please share back.

The Zoo metaphor refers to the weird variety of features.