Data Center Works Inc


The Story of Thud and Blunder

Once upon a time, our company was porting a large auditing program from Sequent to Sun. We had handed it off to the Quality Assurance team, who had a copy of the production environment set up, and they started to run it in parallel to the production version on the Sequent, using an exact copy of today's real data.

Three hours into an 11-hour run it crashed with a segment violation. QA was not happy with us.

All the World's Not a Vax

The bug was the dereferencing of a null pointer, p:

while (p->q != 0) {
        ...
        p++;
}

This loop really should have started with

while (p && p->q != 0) {

but the authors had been trying to get the last little bit of performance they could out of the program, and hadn't wanted the compiler to generate the extra

cmp     %o0,%g0
be      .L101

that the check for a null required. And, sure enough, this same bug was everywhere in the code.

The erroneous code worked on a Vax, because vaxes had a page of zeros residing at address zero of the virtual memory. Dereferencing a null pointer returned a zero.

Sequent had thought that was a good idea and copied it, and programmers who had learned on a Vax could continue to save two instructions every time they wrote an if or while checking a pointer.

Other vendors laid out their virtual memory in different ways, and often didn't have zeros at address zero, so dereferencing zero and crashing became known as the "all the world's a Vax" bug.

This was bad: it looked like we'd need to re-inspect every if and while in the program if we wanted to be sure we had them all. And, of course, there wouldn't be any guarantee that we wouldn't make a mistake in the inspection and miss a bug or two. Which would make us look less than brilliant to the QA manager.

A Page of Nothing, Nowhere

There was one well-known work-around: Edsel Adap, when he was at Sun, had written a library that placed a page of zeros at virtual address zero, using mmap:

fd = open("/dev/zero", O_RDONLY);
mmap(0, (size_t) PAGESIZE, PROT_READ, MAP_PRIVATE|MAP_FIXED, fd, 0) ;

We could load this library with LD_PRELOAD=0@0 and then run the program, and it wouldn't SIGSEGV from null pointers.

However, this wouldn't tell us where the problems actually were, so we wouldn't be able to fix them. It was only a workaround, and the QA team would not be impressed.

Detecting and Diagnosing the Problem

What we did instead was to write a more complicated library. It started the program normally, and then watched for segment violations. If the violation came from dereferencing a null pointer, it would print a stack trace and save a voluntary core file, and then map in the page of zeros.

Now when the program ran, it would catch itself when it made an error, record where it went wrong, and then work around the error. This was much more acceptable to the QA team, as they could see that the program would not fail in future production, even if we didn't find every possible instance of the bug. And it wasn't a “debugging” version of the program, which they had had bad experiences with in the past.

So over the next week, the QA version ran each night with the diagnostic library loaded, and it reported a steady stream of erroneous dereferences. Each morning we'd read the log, see what bugs had been found and fix them, then submit the source changes and new binary to QA.

Over the same week we also worked on the library, improving it with more diagnostics, and giving it a name, thud, By the end of the week it was running the commands

echo "Thud:
additional information follows,and a core will be \
written to /usr/tmp/thud_core.$PID.1"
pstack $PID
pmap -x $PID
psig $PID
pfiles $PID
pwdx $PID
ptree $PID
gcore -o /var/tmp/thud_core.$PID.1 

whenever it found an error.

The completed thud library, for Solaris and other operating systems using elf linker format is available on our tools page.

Right at the end of this exercise, though, we had another SIGSEGV that wasn't from dereferencing zero. Instead it occurred when the stack was overrun.

The Strncpy Blunder

Because the stack was smashed, it would normally be hard to find the culprit, but the report from thud included the address, and we found it was was in a sprintf which which immediately followed a call to strncpy.

Strncpy was written to fill in fixed-length fields in /var/adm/utmp, and it only null-terminates strings which are shorter than the specified length. If, for example, a username is 32 characters or more, the ut_user field in the utmpx file will not be null-terminated.

struct utmpx {
        char ut_user[32];      /* user login name */
        char ut_id[4];         /* inittab id */
        ...

If one only uses strncpy for fixed-length strings which don't require termination, there is no problem. Alas, all too many people use it for a standard string function, expecting it to null-terminate strings of any length. Then they wonder why their stack is corrupted.

So it looked like we would need to re-inspect the program for every use of strncpy and strncat, and change them to strlcpy and strlcat, the proper functions.

Since we had had good luck with using a diagnostic library before, we decided to create one that would diagnose improper uses of strncpy and strncat, so we could identify the incorrect used of strncpy.

The function we created looked like this:

/*
 * strncpy -- copy strings up to N characters
 */
 char *
strncpy(char *dest, const char *src, size_t destsize) {
        unsigned len;

        if ((len = strlcpy(dest, src, destsize)) >= destsize) {
                (void) fprintf(stderr, "%s: "
                        "error in strncpy(0x%p, \"%s\", %d)\n",
                        ProgName, (void *)dest, src, destsize);
                if (len == destsize) {
                        (void) fprintf(stderr, "    "
                                "Attempt to create a "
                                "string exactly %d long, "
                                "which will be unterminated.\n",
                                destsize);
                }
                else {
                        (void) fprintf(stderr, "    "
                                "Atempt to overflow dest string.\n");
                }
                (void) fprintf(stderr, "    "
                        "Truncated to %d characters and "
                        "null-terminated.\n", destsize-1);

                (void) reportError();
        }
        return dest;
}

Because strlcpy returns the length of the source string, it's easy to check and see if the length is sufficient to overflow the destination.

We ran that using LD_PRELOAD, and found several instances of incorrect use of strncpy and strncat, and one proper use, filling a fixed-length structure that would later be compared to another using strncmp. If we hadn't tested, we would have "fixed" that use of strncpy and the comparison would have given different results, creating a brand-new bug.

As we'd called the first program "thud" and the misuse of strncpy was something of a blunder, we named the new library blunder.

Conclusions

When faced with diagnosing hard problems in production, it's often useful to create diagnostic libraries. If you're not on a system with a production-safe tool like dtrace, it's especially desirable. You don't need the source of the erroneous program, and if the library is small, you can show the code to the operations team and reassure them it won't cause worse problems than the bug you're trying to diagnose.

In these two cases, we've used the library to change the behavior of the program, something that dtrace doesn't do. In both case we were able to show the change to the operations (actually QA) team and show them that the changes to the running program were both harmless and desirable.

To make it easy for you, we've provided both the thud and blunder libraries on our