Performance Management Guide

Determining the Kind of Performance Problem Reported

When a performance problem is reported, determining the kind of performance problem often helps the performance analyst to narrow the list of possible culprits.

A Particular Program Runs Slowly

Although this situation might seem trivial, there are still questions to be asked:

Has the program always run slowly?
If the program has just started running slowly, a recent change might be the cause.
Has the source code been changed or a new version installed?
If so, check with the programmer or vendor.
Has something in the environment changed?
If a file used by the program (including its own executable program) has been moved, it may now be experiencing Local Area Network (LAN) delays that did not exist previously. Or, files may be contending for a single disk accessor that were on different disks previously.

If the system administrator has changed system-tuning parameters, the program may be subject to constraints that it did not experience previously. For example, if the schedtune -r command has been used to change the way priority is calculated, programs that used to run rather quickly in the background may now be slowed down, while foreground programs have speeded up.
Is the program written in the perl, awk, csh, or some other interpretive language?
While they allow programs to be written quickly, interpretive languages have the problem that they are not optimized by a compiler. Also, it is easy in a language like perl or awk to request an extremely compute- or I/O-intensive operation with a few characters. It is often worthwhile to perform a desk check or informal peer review of such programs with the emphasis on the number of iterations implied by each operation.
Does the program always run at the same speed, or is it sometimes faster?
The file system uses some of system memory to hold pages of files for future reference. If a disk-limited program is run twice in quick succession, it will normally run faster the second time than the first. Similar phenomena might be observed with programs that use NFS. This can also occur with large programs, such as compilers. The program's algorithm might not be disk-limited, but the time needed to load a large executable program might make the first execution of the program much longer than subsequent ones.
If the program has always run slowly, or has slowed down without any obvious change in its environment, look at its dependency on resources.
Identifying the Performance-Limiting Resource describes techniques for finding the bottleneck.

Everything Runs Slowly at a Particular Time of Day

Most people have experienced the rush-hour slowdown that occurs because a large number of people in the organization habitually use the system at one or more particular times each day. This phenomenon is not always simply due to a concentration of load. Sometimes it is an indication of an imbalance that is (at present) only a problem when the load is high. Other sources of recurring situations in the system should be considered.

If you run the iostat and netstat commands for a period that spans the time of the slowdown (or have previously captured data from your monitoring mechanism), are some disks much more heavily used than others? Is the CPU idle percentage consistently near zero? Is the number of packets sent or received unusually high?
- If the disks are unbalanced, see Monitoring and Tuning Disk I/O Use.
- If the CPU is saturated, use the ps or topas commands to identify the programs being run during this period. The script given in Using the vmstat, iostat, netstat, and sar Commands simplifies the search for the heaviest CPU users.
- If the slowdown is counter-intuitive, such as paralysis during lunch time, look for a pathological program such as a graphic xlock or game program. Some versions of the xlock program are known to use huge amounts of CPU time to display graphic patterns on an idle display. It is also possible that someone is running a program that is a known CPU burner and is trying to run it at the least intrusive time.
Unless your /var/adm/cron/cron.allow file is null, you may want to check the contents of the /var/adm/cron/crontab directory for expensive operations.

If you find that the problem stems from conflict between foreground activity and long-running, CPU-intensive programs that are, or should be, run in the background, consider using the command schedtune -r -d to give the foreground higher priority. See Tuning the Thread-Priority-Value Calculation.

Everything Runs Slowly at Unpredictable Times

The best tool for this situation is an overload detector, such as the filtd program (a component of PTX). The filtd daemon can be set up to execute shell scripts or collect specific information when a particular condition is detected. You can construct a similar, but more specialized, mechanism using shell scripts containing the vmstat, iostat, netstat, sar, and ps commands.

If the problem is local to a single system in a distributed environment, there is probably a pathological program at work, or perhaps two that intersect randomly.

Everything That an Individual User Runs is Slow

Sometimes a system seems to "single out" an individual.

Quantify the problem. Ask the user which commands he uses frequently, and run them with the time command, as in the following example:
```
# time cp .profile testjunk
real    0m0.08s
user    0m0.00s
sys     0m0.01s
```
Then run them under a satisfactory user ID. Is there a difference in the reported real time?
A program should not show much CPU time (user+sys) difference from run to run, but may show a real time difference because of more or slower I/O. Are the user's files on an NFS-mounted directory? On a disk that has high activity for other reasons?
Check the user's .profile file for unusual $PATH specifications. For example, if you always search a couple of NFS-mounted directories (fruitlessly) before searching /usr/bin, everything will take longer.

A Number of LAN-Connected Systems Slow Down Simultaneously

There are some common problems that arise in the transition from independent systems to distributed systems. The problems usually result from the need to get a new configuration running as soon as possible, or from a lack of awareness of the cost of certain functions. In addition to tuning the LAN configuration in terms of maximum transmission units (MTU) and mbufs (see Monitoring and Tuning Communications I/O Use), look for LAN-specific pathologies or nonoptimal situations that may have evolved through a sequence of individually reasonable decisions.

Use network statistics to ensure that there are no physical network problems. Ensure that commands such as netstat -v, entstat, tokstat, atmstat, or fddistat do not show excessive errors or collision on the adapter.
Some types of software or firmware bugs can sporadically saturate the LAN with broadcast or other packets.
When a broadcast storm occurs, even systems that are not actively using the network can be slowed by the incessant interrupts and by the CPU resource consumed in receiving and processing the packets. These problems are better detected and localized with LAN analysis devices than with the normal performance tools.
Do you have two LANs connected through a system?
Using a system as a router consumes large amounts of CPU time to process and copy packets. It is also subject to interference from other work being processed by the system. Dedicated hardware routers and bridges are usually a more cost-effective and robust solution to the need to connect LANs.
Is there a clearly defensible purpose for each NFS mount?
At some stages in the development of distributed configurations, NFS mounts are used to give users on new systems access to their home directories on their original systems. This situation simplifies the initial transition, but imposes a continuing data communication cost. It is not unknown to have users on system A interacting primarily with data on system B and vice versa.

Access to files through NFS imposes a considerable cost in LAN traffic, client and server CPU time, and end-user response time. A general guideline is that user and data should normally be on the same system. The exceptions are those situations in which an overriding concern justifies the extra expense and time of remote data. Some examples are a need to centralize data for more reliable backup and control, or a need to ensure that all users are working with the most current version of a program.

If these and other needs dictate a significant level of NFS client-server interchange, it is better to dedicate a system to the role of server than to have a number of systems that are part-server, part-client.
Have programs been ported correctly (and justifiably) to use remote procedure calls (RPCs)?
The simplest method of porting a program into a distributed environment is to replace program calls with RPCs on a 1:1 basis. Unfortunately, the disparity in performance between local program calls and RPCs is even greater than the disparity between local disk I/O and NFS I/O. Assuming that the RPCs are really necessary, they should be batched whenever possible.

Everything on a Particular Service or Device Slows Down at Times

If everything that uses a particular device or service slows down at times, refer to the topic that covers that particular device or service:

Make sure you have followed the configuration recommendations in the appropriate subsystem manual and the recommendations in the appropriate "Monitoring and Tuning" chapter of this book.