Date: June 15, 2003
I'm often called in to fix a problem with as little information as "it's broke." Finding what's "broke" can be like finding a needle in a haystack in today's complex installations. As a result, It's common to see misdiagnosed problems.
When these situations arise, I use the following steps to help identify the root cause of the problem.
Define the problem precisely The most effective problem definition includes "who, what, when, where, and how much." A single sentence is usually better than a long description because it forces you to focus on the real problem and to eliminate extraneous information.
For example, "All end users in Dallas running the payroll application experienced an increase in response time from 1 to 5 seconds starting on Monday at 10am."
Draw a picture showing all hardware and software components associated with the problem. For example a payroll application might look something like:
Hardware Path : PC Browser -> Network -> Web Servers -> Appl Servers -> Database Server -> SAN
Software Path: Browser -> TCP/IP -> Payroll Application -> Database -> OS -> Multipath I/O -> SAN
Draw a time line showing all changes to the components in the picture. This includes hardware, OS, microcode, software, upgrades, new applications, tuning, user load, data volume, etc. Indicate on the timeline when the problem started.
Identify possible causes based on the symptoms and the timing of the changes. The last change is the likely cause of the problem, but there may be more than one possible cause.
Validate the cause of the problem Does the "cause" match the observed symptoms? Where else would you expect the problem to occur? Is it occuring? If not be, this may not be the cause of the problem.