6 reasons your development team should be using instant messaging

The ElectricAccelerator development team sits at desks less than 30 feet apart, but despite our close proximity, we don’t often speak to one another. To an outside observer this may seem to be a sign of disfunction in the team — after all, developers have to communicate to work effectively. Some people think we’re obviously not communicating, but the truth is that we’re not obviously communicating! That’s because we use instant messaging for most of our communications, including status updates, technical collaboration and even code reviews, rather than face-to-face conversations. I believe this has made my team more connected and more productive. Here are six reasons why instant messaging trumps face-to-face conversations for software teams.

1. Logging

The key advantage of instant messaging is that all conversations are logged automatically. As a result I’ve got records of every conversation with every member of my team for the past two years. That’s proven invaluable on a few occasions, to provide additional context for decisions made weeks or months earlier. Obviously this is not a replacement for other types of project documentation, but it is a fantastic supplement.

2. Non-intrusive

The second most important advantage of instant messaging is that it’s relatively non-intrusive, at least compared to a face-to-face conversation. We all know how important it is to get into and preserve a state of flow when programming. Spoken conversations, by social convention, command your immediate attention — effectively an interrupt of the highest order. When somebody comes to my desk to ask me something in person, they are implicitly saying, “What I have to say to you is more important than anything else you might be doing right now.” Sometimes that’s true, but many times it’s not. And yet every time somebody initiates a face-to-face conversation with me, it destroys whatever flow I might have developed.

In contrast, instant messaging allows me to defer a response until I reach a good breaking point, so people can ask questions without interrupting me.

3. Non-disruptive

Our office has an open floor plan, which means that instead of individual offices or cubicles, we have a single big room. This layout worked very well when the company had only 6 people, who were all working on the same project. Now the company employs over 100 people, with two separate development teams working on completely different products, so the open layout doesn’t work quite so well. Conversations between other people can be very distracting when you’re heads down on a tricky technical problem. By using instant messaging instead of face-to-face conversations, we significantly reduce the distraction for our collegues.

4. Simultaneous conversations

Carrying on multiple face-to-face conversations on disparate topics is practically impossible, but doing the same via instant messenger is simple. Every IM client I’ve seen displays the last several messages of each active conversation, so you have context when a new message arrives. That signficiantly reduces the mental burden associated with each conversation, so it becomes possible to sustain several simultaneously. I often have five conversations “active” during the work day, and sometimes even more.

5. Consistency

Unlike face-to-face conversations, IM works well regardless of the relative locations of the conversants. That means that it doesn’t matter if my colleague is in the office with me, or working from home, or working from a customer site, or halfway around the world. I can use the same tool to communicate with them, which in turn means I don’t have to change the way I work to accomodate changes in the way they are working.

6. Versatility

One final advantage of instant messaging compared to face-to-face conversation is the versatility of the medium. I can trivially share a code fragment with somebody via IM, or a link to an online resource. Try doing that in a face-to-face conversation: “Yeah, you should check out the STL reference docs, at aich tee tee pee colon slash slash double you double you double you dot …”.

Instant messaging: give it a try

If you’re not already using instant messaging in your development team, give it a try. There are multiple free IM services out there, and there are good free IM clients on every platform, including smart phones, so you’ve really got nothing to lose — but you might gain a more efficient, productive team. It worked for us.

Why is SCons so slow?

UPDATE: If you’re coming from Why SCons is not slow, you should read my response

A while back, I did a series of posts exploring the performance of SCons on builds of various sizes. The results were dismal: SCons demonstrated a classic O(n2) growth in runtime, meaning that the length of the build grew in proportion to the square of the number of files in the build, rather than linearly as one would hope. Naturally, that investigation and its results provoked a great deal of discussion at the time and since. Typically, SCons advocates fall back on one particular argument: “Sure, SCons may be slow,” they say, “but that’s the price you pay for a correct build.” Recently, Eric S. Raymond wrote an article espousing this same fundamental argument, with the addition of some algorithmic analysis intended to prove mathematically that a correct build, regardless of the build tool, must necessarily exhibit O(n2) behavior — a clever bit of circular logic, because it implies that any build tool that does not have such abyssmal performance must not produce correct builds!

Naturally, after spending nearly a decade developing a high-performance replacement for GNU make, I couldn’t let that statement stand. This post is probably going to be on the long side, so here’s the tl;dr summary:

  • You can guarantee correct builds with make, provided you follow best practices.
  • The worst-case runtime of any build tool if, of course, O(n2), but most, if not all, builds can be handled in O(n) time, without sacrificing correctness.
  • SCons’ performance problem is caused by design and implementation decisions in SCons, not some pathology of build structure.

What is required to ensure a correct build?

One of the fundamental tenents of the pro-SCons mythos is the idea that it is unique in its ability to guarantee correct builds. In reality, SCons is not doing anything particularly special in this regard. It’s true that by virtue of its design SCons makes it easier to get it right, but there’s nothing keeping you from enjoying the same assurances in make.

First: what is a correct build? Simply put, a correct build is one in which everything that ought to be built, is built. Note that by definition, a from-scratch build is correct, since everything is built in that case. So the question of “correct” or “incorrect” is really only relevant in regards to incremental builds.

So, what do we need in order to ensure a correct incremental build? Only three things, actually:

  1. A single, full-build dependency graph.
  2. Complete dependency information for every generated file.
  3. A reliable way to determine if a file is up-to-date relative to its inputs.

What SCons has done is made it more-or-less impossible, by design, to not have these three things. There is no concept like recursive make in the SCons world, so the only option is a single, full-build dependency graph. Likewise, SCons automatically scans input files in several programming languages to find dependency information. Finally, SCons uses MD5 checksums for the up-to-date check, which is a pretty darn reliable way to verify whether a given file needs to be rebuilt.

But the truth is, you can guarantee correct builds with make — you just have to adhere to long-standing best practices for make. First, you have to avoid using recursive make. Then, you need to add automatic dependency generation. The only thing that’s a little tricky is the up-to-date check: make is hardwired to use file timestamps, which can be spoofed, deliberately or accidentally — although to be fair, in most cases, timestamps are perfectly adequate. But even here, there’s a way out. You can use a smarter version of make that has a more sophisticated up-to-date mechanism, like ElectricMake or ClearMake. You can even shoehorn MD5 checksums into GNU make, if you like.

I can’t deny that SCons has made it easier to get correct builds. But the notion that it can’t be done with make is simply absurd.

What is the cost of a correct build?

Now we turn to the question of the cost of ensuring correctness. At its core, any build tool is just a collection of graph algorithms — first contructing the dependency graph, then traversing it to find and update out-of-date files. These algorithms have well-understood complexity, typically given as O(n + e), where n is the number of nodes in the graph, and e is the number of edges. It turns out that e is actually the dominant factor here, since it is at least equal to n, and at worst as much as n2. That means we can simplify the complexity to O(n + n2), or just O(n2).

Does this absolve SCons of its performance sins? Unfortunately it does not, because O(n2) is a worst-case bound — you should only expect O(n2) behavior if you’ve got a build that has dependencies between every pair of files. Think about that for a second. A dependency between every. pair. of. files. Here’s what that would look like in makefile syntax:

all: foo bar foo.c bar.c foo.h bar.h
foo:     bar foo.c bar.c foo.h bar.h
bar:         foo.c bar.c foo.h bar.h
foo.c:             bar.c foo.h bar.h
bar.c:                   foo.h bar.h
foo.h:                         bar.h

It’s ridiculous, right? I don’t know about you, but I’ve certainly never seen a build that does anything even remotely like that. In particular, the builds I used in my benchmarks don’t look like that. Fortunately, those builds are small and simple enough that we can directly count the number of edges in the dependency graph. For example, the smallest build in my tests consisted of:

2,000 C sources
+ 2,004 headers
+ 2,000 objects
+ 101 libraries
+ 100 executables

6,205 total files

So we have about 6,000 nodes in the graph, but how many edges does the graph contain? Lucky for us, SCons will print the complete dependency graph if we invoke it with scons –tree=all:

+-.
  +-SConstruct
  +-d1_0
  | +-d1_0/SConstruct
  | +-d1_0/f00000_sconsbld_d1_0
  | | +-d1_0/f00000_sconsbld_d1_0.o
  | | | +-d1_0/f00000_sconsbld_d1_0.c
  | | | +-d1_0/lup001_sconsbld_d1_0/f00000_sconsbld_d1_0.h
  ...

The raw listing contains about 35,000 lines of text, but that includes duplicates and non-dependency information like filesystem structure. Filter that stuff out and you can see the graph contains only about 12,000 dependencies. That’s a far cry from the 1,800,000 or so you would expect if this truly were a “worst-case” build. It’s clear, in fact, that the number of edges is best described as O(n).

Although I don’t know how (or even if it’s possible) to prove that this is the general case, it does make a certain intuitive sense: far from being strongly-connected, most of the nodes in a build dependency graph have just one or two edges. Each C source file, for example, has just one outgoing edge, to the object file generated from that source. Each object file has just one outgoing edge too, to the library or executable the object is part of. Sure, libraries and headers probably have more edges, since they are used by multiple executables or objects, but the majority of the stuff in the graph is going to fall into the “small handful of edges” category.

Now, here’s the $64,000 question: if the algorithms in a build tool scale in proportion to the number of edges in the dependency graph, and we’ve just shown that the dependency graph in question has O(n) edges, why does SCons use O(n2) time to execute the build?

Why is SCons so slow?

SCons’ O(n2) performance stems from its graph traversal implementation. Essentially, SCons scans the entire dependency graph each time it is looking for a file to update. n scans of a graph with O(n) nodes and edges equals an O(n2) graph traversal. There’s no mystery here. In fact, the SCons developers are clearly aware of this deficiency, as described on their wiki:

It’s worth noting that the Jobs module calls the Taskmaster once for each node to be processed (i.e., it’s O(n)) and the Taskmaster has an amortized performance of O(n) each time it’s called. Thus, the overall time is O(n^2).

But despite recognizing this flaw, they severely misjudged its impact, because they go on to state that it requires a “pathological” dependency graph in order to elicit this worst-case behavior from SCons. As we’ve shown here and in previous posts, even a terribly mundane dependency graph elicits O(n2) behavior from SCons. I shudder to think what SCons would do with a truly pathological dependency graph!

Obviously the next question is: why does SCons do this? That’s not quite as easy for me to explain, as an outside observer. To the best of my understanding, they rescan the graph just in case new dependencies are added to the dependency graph while evaluating a node in the graph — remember, in SCons the commands to update a file are expressed in Python, so they can easily manipulate the dependency graph even while the build is running.

Is it really necessary to rescan the dependency graph over and over? I don’t think so. In fact, make is proof that it is not necessary. I think there are two ways that SCons could address this problem: first, it could adopt GNU make’s convention of partitioning the build into distinct phases, one that updates dependency information, and a second that actually executes the build. In GNU make, that strategy allows for the introduction of new dependency information, while imposing only a one-time O(n) cost for restarting the make process if any new dependencies are found.

Alternatively, SCons could probably be made smarter about when a full rescan is required. Most of the time, even if new dependencies are added to the graph, they are added to the node being evaluated, not to nodes that were already visited. That is, when you scan a source file for implicit dependencies, you find the dependencies for that file not for other files in the build (duh). So most of the time, a full rescan is massive overkill.

The final word…?

Hopefully this is my last post on the subject of SCons performance. It is clear to me that SCons does not scale to large projects, and that the problem stems from design and implementation decisions in SCons, rather than some pathology in the build itself. You can get comparable guarantees of correctness from make, if you’re willing to invest the time to do things the right way. The payoff is a build system that is not only correct but has vastly better performance than SCons as your project grows. Why wouldn’t you want that?

HOWTO: ship a custom kernel driver for Linux

Pop quiz, hotshot: your company has developed a Linux kernel driver as part of its product offering. How do you deliver this driver such that your product is compatible with a significant majority of the Linux variants you are likely to encounter in the field? Consider the following:

  • RedHat Enterprise Linux 4 is based on kernel version 2.6.9
  • RHEL 5 is based on kernel version 2.6.18
  • RHEL 6 is based on 2.6.32
  • openSUSE 11.0 is based on 2.6.25
  • openSUSE 11.1 is based on 2.6.27
  • Ubuntu 9.04 is based on 2.6.28
  • Ubuntu 9.10 is based on 2.6.31
  • Ubuntu 10.04 is based on 2.6.32
  • Ubuntu 10.10 is based on 2.6.35

I could go on, but hopefully you get the point — “Linux” is not a single, identifiable entity, but rather a collection of related operating systems. And thus the question: how do you ship your driver such that you can install and use it on a broad spectrum of Linux variants? This is a problem that I’ve had to solve in my work.

Fundamentally, the solution is simple: ship the driver in source form. But that answer isn’t much help unless you can make your driver source-compatible with a wide range of kernel versions, spanning several years of Linux development. The solution to that problem is simple too, in hindsight, and yet I haven’t seen it used or described elsewhere: test for specific kernel features using something like a configure script; set preprocessor macros based on the results of the tests; and use the macros in the driver source to conditionally include code as needed. But before I get into the details of this solution, let’s look briefly at a few alternative solutions and why each was rejected.

Rejected alternatives: how NOT to ship a custom driver for Linux

Based on my informal survey of the state-of-the-art in this field, it seems there are three common approaches to solving this problem:

  1. Arrange for your driver to be bundled with the Linux kernel. If you can pull this off, fantastic! You’ve just outsourced the effort of porting your driver to the people who build and distribute the kernel. Unfortunately, kernel developers are not keen on bundling drivers that are not generally useful — that is, your driver has to have some utility outside of your specific application, or you can forget getting it bundled into the official kernel. Also, if you have any interesting IP in your driver, open-sourcing it is probably not an option.
  2. Prebuild your driver for every conceivable Linux variant. If you know which Linux variants your product will support, you could build the driver for each, then choose one of the prebuilt modules at installation time based on the information in /etc/issue and uname -r. VMWare uses this strategy — after installing VMWare Workstation, take a look in /usr/lib/vmware/modules/binary: you’ll find about a hundred different builds of their kernel modules, for various combinations of kernel versions, distributions and SMP-status. The trouble with this strategy is that it adds significant complexity to your build and release process: you need a build environment for every one of those variants. And all those modules bloat your install bundle. Finally, no matter how many distro’s you prebuild for, it will never be enough: somebody will come along and insist that your code install on their favorite variant.
  3. Ship source that uses the LINUX_VERSION_CODE and KERNEL_VERSION macros. These macros, defined by the Linux kernel build system, allow you to conditionally include code based on the version of the kernel being built. In theory this is all you need, if you know which version introduced a particular feature. But there are two big problems. First, you probably don’t know exactly which version introduced each feature. You could figure it out with some detective work, but who’s got the time to do that? Second, and far more troublesome, most enterprise Linux distributions (RHEL, SUSE, etc.) backport features and fixes from later kernels to their base kernel — without changing the value of LINUX_VERSION_CODE. Of course that renders this mechanism useless.

genconfig.sh: a configure script for kernel modules

Conceptually, genconfig.sh works the same way as an autoconf configure script: it uses a series of trivial test programs to check for different kernel features or constructs. The success or failure of each test to compile determines whether the corresponding feature is present, and by extension whether or not a particular bit of code ought to be included in the driver.

For example, in some versions of the Linux kernel (2.6.9, eg), struct inode includes a member called i_blksize. If present, this field should be set to the blocksize of the filesystem that owns the inode. It’s used in the implementation of the stat(2) system call. It’s a minor detail, but if you’re implementing a filesystem driver, it’s important to get it right.

We can determine whether or not to include code for this field by trying to compile a trivial kernel module containing just this code:

#include <linux/fs.h>
void dummy(void)
{
    struct inode i;
    i.i_blksize = 0;
    return;
}

If this code compiles, then we know to include code for managing the i_blksize field. We can create a header file containing a #define corresponding to this knowledge:

#define HAVE_INODE_I_BLKSIZE

Finally, the driver code uses that definition:

#ifdef HAVE_INODE_I_BLKSIZE
  inode->i_blksize = FS_BLOCKSIZE;
#endif

We can construct an equally trivial test case for each feature that is relevant to our driver. In the end we get a header with a series of defines, something like this:

#define HAVE_INODE_I_BLKSIZE
#define HAVE_3_ARG_INT_POSIX_TEST_LOCK
#define HAVE_KMEM_CACHE_T
#define HAVE_MODE_IN_VFS_SYMLINK
#define HAVE_3_ARG_PERMISSION
#define HAVE_2_ARG_UMOUNT_BEGIN
#define HAVE_PUT_INODE
#define HAVE_CLEANUP_IN_KMEM_CACHE_CREATE
#define HAVE_WRITE_BEGIN
#define HAVE_ADDRESS_SPACE_OPS_EXT
#define HAVE_SENDFILE
#define HAVE_DENTRY_IN_FSYNC

By referencing these definitions in the driver source code, we can make it source-compatible with a wide range of Linux kernel versions. To add support for a new kernel, we just have to determine which changes affect our module, write tests to check for those features, and update only the affected parts of our driver source.

This is more nimble, and far more manageable, than shipping prebuilt binaries for an endless litany of kernel variants. And it’s much more robust than relying on LINUX_VERSION_CODE: rather than implicitly trusting that a feature is present or absent based on an unreliable version string, we know for certain whether that feature is present, because we explicitly tried to use it.

Belt and suspenders: ensuring the driver works correctly

Now we have a strategy for shipping a driver that will build and load on a broad array of Linux variants. But this approach has introduced a new problem: how can we be sure that this driver that was just auto-configured and compiled on-the-fly will actually work as expected?

The solution to this problem has two components. First, we identified about a dozen specific Linux variants that are critical to our customers. The driver is exhaustively tested on each of these “tier 1” variants in every continuous integration build — over 3,000 automated unit tests are run against the driver on each. Of course, 12 variants is only a tiny fraction of the thousands of permutations that are possible, but by definition these variants represent the most important permutations to get right. We will know immediately if something has broken the driver on one of these variants.

Next, we ship a stripped down version of that unit test suite, and execute that automatically when the driver is built. This suite has only about 25 tests, but those tests cover every major piece of functionality — a reasonable compromise between coverage and simplicity. With this install-time test suite, we’ll know if there’s a problem with the driver on a particular platform as soon as somebody tries to install it.

Demonstration code

For demonstration purposes I have placed a trivial filesystem driver on my github repo. This driver, base0fs, was generated using the FiST filesystem generator, patched to make use of the genconfig.sh concept.

Eternal vigilance and multithreaded programming

Given the following:

  • A multithreaded application written in C++, using the STL.
  • A class Mutex, which is a thin wrapper around a pthread_mutex_t.
  • A class Lock, which implements the RAII pattern for acquiring and releasing a Mutex.
  • A class FileInfo, which stores information about a file. The getSize() method retrieves the current size of the file. Because another thread may be changing the size of the file, getSize() is internally locked.
  • A class Cache, which manages a map of filenames to FileInfo objects.

Can you spot the bug in this bit of code?

class Cache {
protected:
    typedef std::map<string, FileInfo *> map_type;

    Mutex mLock;
    map_type mMap;

public:
    void insert(const string& filename, FileInfo *info)
    {
        // Add a new file to the cache.

        Lock lock(mLock);
        mMap.insert(make_pair(filename, info));
    }

    void invalidate(const string& filename)
    {
        // Eject a file from the cache; this only
        // invalidates the cache entry, it does not
        // delete the associated FileInfo.

        Lock lock(mLock);
        map_type::iterator i = mMap.find(filename);
        if (i != mMap.end()) {
            mMap.erase(i);
        }
    }

    int64_t getSize(const string& filename)
    {
        // Get the current size of a file in the cache.

        map_type::iterator i;

        {
            // Grab the cache lock for the lookup.

            Lock lock(mLock);
            i = mMap.find(filename);
            if (i == mMap.end()) {
                return 0;
            }
        } // release cache lock

        // Relay the getSize() request to the FileInfo
        // object; this must happen <i>after</i> releasing
        // the cache lock, because FileInfo::getSize() 
        // can block.

        return i->second->getSize();
    }
};

Even if you’re familiar with the STL and multithreaded programming, the defect can be hard to spot. In fact, this one escaped my notice until a recent mysterious core dump alerted me to its presence, despite having read this code several times for other reasons.

Here’s the bug: in Cache::getSize(), the iterator i is accessed outside the protection of the Cache lock. You may think that this is safe, since the lookup occurs inside the lock. But suppose we have two threads running simultaneously, one calling getSize(“foo”), and the other calling invalidate(“foo”):

Thread A: getSize(“foo”) Thread B: invalidate(“foo”)
{
    Lock lock(mLock);
    i = mMap.find(filename);
    if (i == mMap.end()) {
       return 0;
    }
} // release cache lock
Lock lock(mLock);
i = mMap.find(filename);
if (i != mMap.end()) {
    mMap.erase(i);
}
return i->second->getSize();

By the time Thread A uses the iterator returned by map::find(), the call to invalidate() has, well, invalidated it. I suspect that the author of this code remembered that map::erase() invalidates the iterator used in the erase() itself, but not the corollary that map::erase() also invalidates any other iterator that references the same element.

If you think about how std::map is implemented, it’s obvious that erase() should have this property. Internally, std::map usually uses some type of balanced tree. With that foundation, the simplest implementation of std::map::iterator is just a thin wrapper around a pointer to a node in the tree. map::erase() must delete that node (otherwise, when would it be deleted?), so accessing an iterator after it has been erase‘d is literally a free memory read, an obvious memory management error.

In this case, the solution is simple: just extract the FileInfo pointer from the iterator before releasing the Cache lock, as follows:

    int64_t getSize(const string& filename)
    {
        // Get the current size of a file in the cache.

        FileInfo *info;

        {
            // Grab the cache lock for the lookup.

            Lock lock(mLock);
            map_type::iterator i = mMap.find(filename);
            if (i == mMap.end()) {
                return 0;
            }
            info = i->second;
        } // release cache lock

        // Relay the getSize() request to the FileInfo
        // object; this must happen <i>after</i> releasing
        // the cache lock, because FileInfo::getSize() 
        // can block.

        return info->getSize();
    }

Eternal vigilance

The interesting thing about this bug is that it would have been nearly impossible to detect with testing alone. This is why code reviews are so important in our profession, and why projects that really demand bug-free code rely so heavily on them. As they say, “Eternal vigilance is the price of multithreaded programming.” With that, I’m off to review some more code. Shouldn’t you?