post

Eternal vigilance and multithreaded programming

Given the following:

  • A multithreaded application written in C++, using the STL.
  • A class Mutex, which is a thin wrapper around a pthread_mutex_t.
  • A class Lock, which implements the RAII pattern for acquiring and releasing a Mutex.
  • A class FileInfo, which stores information about a file. The getSize() method retrieves the current size of the file. Because another thread may be changing the size of the file, getSize() is internally locked.
  • A class Cache, which manages a map of filenames to FileInfo objects.

Can you spot the bug in this bit of code?

class Cache {
protected:
    typedef std::map<string, FileInfo *> map_type;

    Mutex mLock;
    map_type mMap;

public:
    void insert(const string& filename, FileInfo *info)
    {
        // Add a new file to the cache.

        Lock lock(mLock);
        mMap.insert(make_pair(filename, info));
    }

    void invalidate(const string& filename)
    {
        // Eject a file from the cache; this only
        // invalidates the cache entry, it does not
        // delete the associated FileInfo.

        Lock lock(mLock);
        map_type::iterator i = mMap.find(filename);
        if (i != mMap.end()) {
            mMap.erase(i);
        }
    }

    int64_t getSize(const string& filename)
    {
        // Get the current size of a file in the cache.

        map_type::iterator i;

        {
            // Grab the cache lock for the lookup.

            Lock lock(mLock);
            i = mMap.find(filename);
            if (i == mMap.end()) {
                return 0;
            }
        } // release cache lock

        // Relay the getSize() request to the FileInfo
        // object; this must happen <i>after</i> releasing
        // the cache lock, because FileInfo::getSize() 
        // can block.

        return i->second->getSize();
    }
};

Even if you’re familiar with the STL and multithreaded programming, the defect can be hard to spot. In fact, this one escaped my notice until a recent mysterious core dump alerted me to its presence, despite having read this code several times for other reasons.

Here’s the bug: in Cache::getSize(), the iterator i is accessed outside the protection of the Cache lock. You may think that this is safe, since the lookup occurs inside the lock. But suppose we have two threads running simultaneously, one calling getSize(“foo”), and the other calling invalidate(“foo”):

Thread A: getSize(“foo”) Thread B: invalidate(“foo”)
{
    Lock lock(mLock);
    i = mMap.find(filename);
    if (i == mMap.end()) {
       return 0;
    }
} // release cache lock
Lock lock(mLock);
i = mMap.find(filename);
if (i != mMap.end()) {
    mMap.erase(i);
}
return i->second->getSize();

By the time Thread A uses the iterator returned by map::find(), the call to invalidate() has, well, invalidated it. I suspect that the author of this code remembered that map::erase() invalidates the iterator used in the erase() itself, but not the corollary that map::erase() also invalidates any other iterator that references the same element.

If you think about how std::map is implemented, it’s obvious that erase() should have this property. Internally, std::map usually uses some type of balanced tree. With that foundation, the simplest implementation of std::map::iterator is just a thin wrapper around a pointer to a node in the tree. map::erase() must delete that node (otherwise, when would it be deleted?), so accessing an iterator after it has been erase‘d is literally a free memory read, an obvious memory management error.

In this case, the solution is simple: just extract the FileInfo pointer from the iterator before releasing the Cache lock, as follows:

    int64_t getSize(const string& filename)
    {
        // Get the current size of a file in the cache.

        FileInfo *info;

        {
            // Grab the cache lock for the lookup.

            Lock lock(mLock);
            map_type::iterator i = mMap.find(filename);
            if (i == mMap.end()) {
                return 0;
            }
            info = i->second;
        } // release cache lock

        // Relay the getSize() request to the FileInfo
        // object; this must happen <i>after</i> releasing
        // the cache lock, because FileInfo::getSize() 
        // can block.

        return info->getSize();
    }

Eternal vigilance

The interesting thing about this bug is that it would have been nearly impossible to detect with testing alone. This is why code reviews are so important in our profession, and why projects that really demand bug-free code rely so heavily on them. As they say, “Eternal vigilance is the price of multithreaded programming.” With that, I’m off to review some more code. Shouldn’t you?

%d bloggers like this: