Effective Proactive Debugging Techniques: It's All About the Tools

Overview

There are two basic ways to debug your program: retroactively (by stepping through it or analyzing core dumps) or proactively (by including code that analyzes and logs the state of your application as it's running). This article will focus on proactive debugging tools and techniques that you can implement on any platform.

Just Say "No" To Debuggers

Don't get me wrong, IDEs and debuggers are extremely helpful in stepping through and analyzing your code. But sometimes you don't always have the luxury of an IDE or debugger, such as on a live system that you can't tweak or take down. You may also be working with a networked application that depends on asynchronous systems outside of your control.

My point here is that by the time you break out the debugger, the error has already occurred, the program has terminated, and the client is wondering if you really know what you're doing.

Wouldn't it be great if your program could:

Catch potential errors proactively, i.e. before they happen
Report errors and warnings behind the scenes so the user either isn't bothered with technical jargon or doesn't even see a message if they don't need to
Log info that you want logged, not just a massive core dump that may not include the state of external systems
Recover gracefully whenever possible

Let's cover each of these in more detail.

Catching Errors Proactively

In a nutshell, this involves adding code that obsessively checks for error conditions and acts accordingly. First, you'll want to make sure you're testing the return value of every function. This is Programming 101, and should go without saying. For example, here is bad code that doesn't check the return value of a function:

int *a = malloc(1024);

a[0] = ‘x';

If you don't know what's wrong with this code then you probably shouldn't be reading the rest of this article.

Here's an improved version of the above code:

int *a = malloc(1024);

if (NULL == a) {

fprintf("Unable to allocate 1024 bytes\n");

exit(-1);

}

a[0] = ‘x';

This is better, but all it does is print an error message and exit. This is good enough for a school project, but won't (or shouldn't) fly in the real world. Let's fix it up a bit:

int *a = malloc(1024);

if (NULL == a) {

// Log the error

report_fatal_error("Unable to allocate 1024 bytes");

} else {

a[0] = ‘x';

}

What's this report_fatal_error() function and why is it any better than printing an error message and exiting, as in the previous example?

In this example, report_fatal_error() is a function written by you. It can be implemented any way you want; in a development environment it may print a message and exit, while in a production environment it may log the error to a log file and continue running (if possible).

Sure, not every error condition is recoverable, but we'll get to that later.

Would a "try…catch" block would work just as well as "if…then"? Certainly! The point here is that at every possible point of failure you should be able to identify the module that failed, the precise line of code where the error occurred, and any local variables or other relevant data.

Reporting Errors

Now that you've caught an error, what do you do with it? Report it immediately! Depending on the severity of the error, this could involve something as simple as ignoring it completely (not likely), writing a diagnostic message to a log file (very likely), or abending immediately (to be avoided if at all possible).

Deciding what events could happen and how severe they are is left as an exercise for the reader since it will depend largely on the nature of your application. At minimum you'll want two general categories:

User error
System error

More likely you'll want to break the latter into two separate categories, resulting in:

User error
Recoverable system error
Unrecoverable system error

As for where to report errors, generally two types of people will want to know about them:

Users
Programmers

For user errors you'll obviously want to display a message that the user can understand.

For recoverable system errors you may or may not want to alert the user but you'll definitely want to alert the programmer (more on this later).

For unrecoverable system errors you will want to alert both the user and the programmer.

Logging

What to log

The more information you have about each error, the easier it will be to diagnose. So, what will you want to log?

Date and time the error occurred
Module and line number where the error occurred
State of the call stack
State of global variables (although you're smart enough to minimize or even eliminate the use of global variables from your programs, right?)
Who caused the error (user who was logged in at the time)

Where to log it

Two places you might want to log errors would be:

database
text file

Which of these options you choose will depend on which ones are available to you.

The advantage of logging to a database is that the log is easily searchable and sortable; the advantage of logging to a flat file is that it's usually simpler and the data is available immediately (no DB connection required).

Another issue to consider is how to log an error to the database if the error involves not being able to connect to the database to begin with. I prefer a layered approach: first attempt to log to the database; if that doesn't work then attempt to log to a text file; if that doesn't work then attempt to write to the console.

Messages

Writing a helpful error message

Error messages can be incredibly useful, horribly useless, or anywhere in between depending on how they're worded. Useful error messages give as many details as possible about the nature of the error. Here is an example of a useless error message:

Error!

Some of you reading this will laugh because that's a silly example; others will laugh harder because they've actually seen error messages like this.

Let's improve it a bit:

Error! $a is 0

Okay, this tells us more about the nature of the error, but it doesn't tell us what $a represents or what it should be. Let's improve it some more:

Error! $a (number of articles) is ‘0' but expected an integer > 0

The obvious changes here are that I've indicated what $a represents (number of articles) and what the expected value should be. Can you spot the other, less obvious change? I've added single quotes around the value of $a (in this case 0). Why? So that in case $a contained extraneous characters (such as spaces) it would be more obvious. Also, I want to know whether $a is 0, null, false, blank, or something else. Logically there may be no difference between 0, null, false, blank and whitespace, but to the programmer the difference between one and the other can provide valuable clues as to where the error was introduced.

All too often the extra garbage that causes errors is not always visible in every context. This includes high ASCII characters and control characters (carriage returns, linefeeds, tabs, nulls), and HTML tags and entities. Programmers who try to print out these values verbatim will often never see the problem because it's being masked by the filtering that happens, such as when your text editor strips blanks or your Web browser interprets HTML tags. If you want to display the real underlying data you often have to resort to displaying it in non-standard ways.

Anyway, let's continue improving our error message. We're still missing crucial information about the error: when and where.

2006-10-22 14:26:45: ERROR: $a (number of articles) is ‘0' but expected an integer > 0 on line 5 of /foo/bar.baz

In this case, "when" is the date and time, and "where" is the file name and line number.

Of course, your language and application may provide additional data or meta-data that would be useful for debugging that you might want to display; those details are left as an exercise for you, since I obviously don't know what platform you're running.

Attributes of a useful error message

Your error message should contain, at minimum, the following:

What went wrong? (Technical problem, plus explanation in plain English)
What should have happened? (e.g. expected value(s))
When did the error occur? (Date/time)
Where did it occur? (File/line)

Recovering Gracefully

In case you haven't guessed, this is one of my pet peeves. Too many programmers simply cause their programs to exit when they're unable to perform a required function such as connecting to a database. At best it's ugly; at worst it's rude, inconvenient and unprofessional.

For example:

$db = mysql_connect() or die(mysql_error());

$result = mysql_query("SELECT SESSION_USER(), CURRENT_USER();");

I would fix this code as follows:

Wrap the call to mysql_connect() with my own wrapper function that tries several times to connect before failing
Log the error so the programmer can see it and take action

Here is my updated code:

$db = connect_to_db();

if ($db) {

$result = mysql_query("SELECT SESSION_USER(), CURRENT_USER();");

// Do more stuff here

} else {

report_fatal_error(‘Unable to connect to DB.');
}

Summary

Instead of waiting for errors to happen and then debugging core dumps, write code that actively looks for problematic situations at runtime, reports them in a way that will help you fix them, and continues executing to the best of its abilities.

===END===

Coding

Management

Photography

Effective Proactive Debugging Techniques: It's All About the Tools

Overview

Just Say "No" To Debuggers

Catching Errors Proactively

Reporting Errors

Logging

What to log

Where to log it

Messages

Writing a helpful error message

Attributes of a useful error message

Recovering Gracefully

Summary