When a performance problem is reported, determining the kind of performance problem often helps the performance analyst to narrow the list of possible culprits.
Although this situation might seem trivial, there are still questions to be asked:
If the program has just started running slowly, a recent change might be the cause.
If so, check with the programmer or vendor.
If a file used by the program (including its own executable program) has been moved, it may now be experiencing Local Area Network (LAN) delays that did not exist previously. Or, files may be contending for a single disk accessor that were on different disks previously.
If the system administrator has changed system-tuning parameters, the program may be subject to constraints that it did not experience previously. For example, if the schedtune -r command has been used to change the way priority is calculated, programs that used to run rather quickly in the background may now be slowed down, while foreground programs have speeded up.
While they allow programs to be written quickly, interpretive languages have the problem that they are not optimized by a compiler. Also, it is easy in a language like perl or awk to request an extremely compute- or I/O-intensive operation with a few characters. It is often worthwhile to perform a desk check or informal peer review of such programs with the emphasis on the number of iterations implied by each operation.
The file system uses some of system memory to hold pages of files for future reference. If a disk-limited program is run twice in quick succession, it will normally run faster the second time than the first. Similar phenomena might be observed with programs that use NFS. This can also occur with large programs, such as compilers. The program's algorithm might not be disk-limited, but the time needed to load a large executable program might make the first execution of the program much longer than subsequent ones.
Identifying the Performance-Limiting Resource describes techniques for finding the bottleneck.
Most people have experienced the rush-hour slowdown that occurs because a large number of people in the organization habitually use the system at one or more particular times each day. This phenomenon is not always simply due to a concentration of load. Sometimes it is an indication of an imbalance that is (at present) only a problem when the load is high. Other sources of recurring situations in the system should be considered.
If you find that the problem stems from conflict between foreground activity and long-running, CPU-intensive programs that are, or should be, run in the background, consider using the command schedtune -r -d to give the foreground higher priority. See Tuning the Thread-Priority-Value Calculation.
The best tool for this situation is an overload detector, such as the filtd program (a component of PTX). The filtd daemon can be set up to execute shell scripts or collect specific information when a particular condition is detected. You can construct a similar, but more specialized, mechanism using shell scripts containing the vmstat, iostat, netstat, sar, and ps commands.
If the problem is local to a single system in a distributed environment, there is probably a pathological program at work, or perhaps two that intersect randomly.
Sometimes a system seems to "single out" an individual.
# time cp .profile testjunk real 0m0.08s user 0m0.00s sys 0m0.01s
Then run them under a satisfactory user ID. Is there a difference in the reported real time?
There are some common problems that arise in the transition from independent systems to distributed systems. The problems usually result from the need to get a new configuration running as soon as possible, or from a lack of awareness of the cost of certain functions. In addition to tuning the LAN configuration in terms of maximum transmission units (MTU) and mbufs (see Chapter 9. Monitoring and Tuning Communications I/O Use), look for LAN-specific pathologies or nonoptimal situations that may have evolved through a sequence of individually reasonable decisions.
When a broadcast storm occurs, even systems that are not actively using the network can be slowed by the incessant interrupts and by the CPU resource consumed in receiving and processing the packets. These problems are better detected and localized with LAN analysis devices than with the normal performance tools.
Using a system as a router consumes large amounts of CPU time to process and copy packets. It is also subject to interference from other work being processed by the system. Dedicated hardware routers and bridges are usually a more cost-effective and robust solution to the need to connect LANs.
At some stages in the development of distributed configurations, NFS mounts are used to give users on new systems access to their home directories on their original systems. This situation simplifies the initial transition, but imposes a continuing data communication cost. It is not unknown to have users on system A interacting primarily with data on system B and vice versa.
Access to files through NFS imposes a considerable cost in LAN traffic, client and server CPU time, and end-user response time. A general guideline is that user and data should normally be on the same system. The exceptions are those situations in which an overriding concern justifies the extra expense and time of remote data. Some examples are a need to centralize data for more reliable backup and control, or a need to ensure that all users are working with the most current version of a program.
If these and other needs dictate a significant level of NFS client-server interchange, it is better to dedicate a system to the role of server than to have a number of systems that are part-server, part-client.
The simplest method of porting a program into a distributed environment is to replace program calls with RPCs on a 1:1 basis. Unfortunately, the disparity in performance between local program calls and RPCs is even greater than the disparity between local disk I/O and NFS I/O. Assuming that the RPCs are really necessary, they should be batched whenever possible.
If everything that uses a particular device or service slows down at times, refer to the topic that covers that particular device or service:
Make sure you have followed the configuration recommendations in the appropriate subsystem manual and the recommendations in the appropriate "Monitoring and Tuning" chapter of this book.