[ Previous | Next | Table of Contents | Index | Library Home |
Legal |
Search ]
Performance Management Guide
You should report operating system performance problems to IBM
support. Use your normal software problem-reporting channel. If
you are not familiar with the correct problem-reporting channel for your
organization, check with your IBM representative.
Support personnel need to determine when a reported problem is a functional
problem or a performance problem. When an application, a hardware
system, or a network is not behaving correctly, this is referred to as a
functional problem. For example, an application or a system
with a memory leak has a functional problem.
Sometimes functional problems lead to performance problems; for
example, when the functions are being achieved, but the speed of the functions
are slow. In these cases, rather than tune the system, it is more
important to determine the root cause of the problem and fix it.
Another example would be when communication is slowed because of networks or
name servers that are down.
Support personnel often receive problem reports stating that someone has a
performance problem on the system and providing some data analysis.
This information is insufficient to accurately determine the nature of a
performance problem. The data might indicate 100 percent CPU
utilization and a high run queue, but that may have nothing to do with the
cause of the performance problem.
For example, a system might have users logged in from remote terminals over
a network that goes over several routers. The users report that the
system is slow. Data might indicate that the CPU is very heavily
utilized. But the real problem could be that the characters get
displayed after long delays on their terminals due to packets getting lost on
the network (which could be caused by failing routers or overloaded
networks). This situation might have nothing to do with the CPU
utilization on the machine. If on the other hand, the complaint was
that a batch job on the system was taking a long time to run, then CPU
utilization or I/O bandwidth might be related.
Always obtain as much detail as possible before you attempt to collect or
analyze data, by asking the following questions regarding the performance
problem:
- Can the problem be demonstrated by running a specific command or
reconstructing a sequence of events? (for example: ls
/slow/fs or ping xxxxx). If not, describe
the least complex example of the problem.
- Is the slow performance intermittent? Does it get slow, but then disappear
for a while? Does it occur at certain times of the day or in relation to some
specific activity?
- Is everything slow or only some things?
- What aspect is slow? For example, time to echo a character, or elapsed
time to complete a transaction, or time to paint the screen?
- When did the problem start occurring? Was the situation the same ever
since the system was first installed or went into production? Did anything
change on the system before the problem occurred (such as adding more users or
migrating additional data to the system)?
- If client/server, can the problem be demonstrated when run just locally on
the server (network versus server issue)?
- If network related, how are the network segments configured (including
bandwidth such as 10 Mb/sec or 9600 baud)? Are there any routers between the
client and server?
- What vendor applications are running on the system, and are those
applications involved in the performance issue?
- What is the impact of the performance problem on the users?
When someone reports a performance problem, it is not enough just to gather
data and then analyze it. Without knowing the nature of the performance
problem, you might waste a lot of time analyzing data which may have nothing
to do with the problem being reported.
Before you involve support personnel to report a problem, prepare in
advance the information that you will be asked to supply to facilitate the
problem to be investigated. Your local support personnel will attempt
to quickly solve your performance problem directly with you.
Three further ways you can help to get the problem resolved faster
are:
- Provide a clear written statement of a simple specific instance of
problem, but be sure to separate the symptoms and facts from the theories,
ideas and your own conclusions. PMRs that report "the system is slow"
require extensive investigation to determine what you mean by slow, how it is
measured, and what is acceptable performance.
- Provide information about everything that has changed on the system in the
weeks before the problem. Missing something that changed can block a
possible investigation path and will only delay finding a resolution.
If all the facts are available, the performance team can quickly eliminate the
unrelated ones.
- Use the correct machine to supply information. In very large sites
it is easy to accidentally collect the data on the wrong machine. This
makes it very hard to investigate the problem.
When you report the problem, supply the following basic information:
- A problem description that can be used to search the problem-history
database to see if a similar problem has already been reported.
- What aspect of your analysis led you to conclude that the problem is due
to a defect in the operating system?
- What is the hardware and software configuration in which the problem is
occurring?
- Is the problem confined to a single system, or does it affect multiple
systems?
- What are the models, memory sizes, as well as number and size of disks on
the affected systems?
- What kinds of LAN and other communications media are connected to the
systems?
- Does the overall configuration include those for other operating systems?
- What are the characteristics of the program or workload that is
experiencing the problem?
- Does an analysis with the time, iostat, and
vmstat commands indicate that it is CPU-limited or I/O-limited?
- Are the workloads being run on the affected systems: workstation,
server, multiuser, or a combination?
- What are the performance objectives that are not being met?
- Is the primary objective in terms of console or terminal response time,
throughput, or real-time responsiveness?
- Were the objectives derived from measurements on another system? If so,
what was its configuration?
If this is the first report of the problem, you will receive a PMR number
for use in identifying any additional data you supply and for future
reference.
[ Previous | Next | Table of Contents | Index |
Library Home |
Legal |
Search ]