Performance Management Guide

Responding to PDT Report Messages

PDT identifies many types of problems. Responses to these indications depend on the individual organization's available resources and set of priorities. The following samples suggest some possibilities (cmds stands for commands):

Problem:

JFS file system becomes unavailable

Response:

Investigate why file system is unavailable. The file system could have been removed.

Useful cmds:

lsfs (to determine file system status)

Problem:

JFS file system nearly full

Response:

This problem could be caused by large or core files within the file system. Look for large files in the file system, possibly caused by a runaway process. Attempt to identify the process or user that generated those files. The system administrator should also verify if PDT report indicates a long-term growth trend for this file system. Has this file system exhibited long-term growth trend (examine the rest of the PDT report or past PDT reports)?

Useful cmds:

du, ls

Problem:

Physical volume not allocated to a volume group

Response:

If a physical volume is not allocated to a volume group, the operating system has no access to this disk, and its space is being wasted. Use the lspv command to ensure that the disk is not allocated to any volume group, and if not, use the extendvg command to add the disk to a volume group.

Useful cmds:

lspv (to confirm that the volume is not allocated)
smitty (to manipulate volume groups)

Problem:

All paging spaces defined on one physical volume

Response:

The system has more than one physical volume, yet all paging space is defined on a single volume. If the system experiences paging, this configuration will result in reduced performance. A better I/O throughput could be achieved if the paging space is split equally among all physical volumes. Only one paging space should be defined per physical volume because the system will only have access to one at a time. Use SMIT to create, modify, activate, or deactivate the paging areas.

Useful cmds:

smitty (to modify paging spaces)

Problem:

Apparently too little memory for current workload

Response:

If the system is paging heavily, more memory may be required on the system for good performance. The vmstat or svmon commands provide further details about the paging activity. See The vmstat Command, and The svmon Command, for more information on those commands.

Useful cmds:

lsps -a, vmstat, svmon

Problem:

Page space nearly full

Response:

The system's paging space may not be well dimensioned and may need to be enlarged, unless the problem is due to a process with a memory leak, in which case that process should be identified and the application fixed. For systems up to 256 MB of memory, the paging space should be twice the size of real memory. For memories larger than 256 MB, use the following formula:
total paging space = 512 MB + (memory size - 256 MB) * 1.25

Useful cmds:

ps aucg (to examine process activity)
smitty (to modify page space characteristics)

Problem:

Possible problems in the settings of load control parameters

Response:

The memory-load-control parameters are evaluated in relation to current paging activity. For example, if thrashing is occurring and load control is not enabled, it may be appropriate to enable load control. This situation might also be due to inappropriate load-control parameter settings. Use the schedtune command to view or alter the configuration. Refer to Tuning VMM Memory Load Control with the schedtune Command, for further details.

Useful cmds

schedtune

Problem:

VMM-detected bad memory frames

Response:

It might be necessary to have the memory analyzed. Compare the amount of installed memory with the memory actually accessible. If the latter is less than the former, then bad memory has been identified.
You can use /usr/sbin/perf/diag_tool/getvmparms to examine the value of numframes to determine the actual number of valid 4 KB memory frames.

Useful cmds:

lscfg | grep mem (to obtain installed memory size in MB)

Problem:

Any host in .nodes becomes unreachable

Response:

Determine if problem is with current host (has a change in the /etc/hosts file been made?), with the remote host (is it down?), or with the network (is the nameserver down?).
The problem may be due to name resolution. Either Domain Name Service (DNS) configuration files or /etc/hosts should be checked depending on the type of name resolution being used at the environment.
Use the ping command to check if the machine has access to other nodes in the same network. The remote node might be down. If it cannot access any other node, cables and connections should be verified, as well as the routing table of the current machine. Verify cables and connections by executing the netstat -r command.

Useful cmds:

ping, netstat

Problem:

Imbalance in the I/O configuration (number of disks per adapter)

Response:

The number of disks per adapter should be equal whenever possible, to prevent one adapter from being overloaded. A guideline is to have no more than four devices per adapter, especially if the access to the disks is mostly sequential. Consider moving disks around so that an individual adapter is not overloaded.

Useful cmds:

lscfg (to examine the current configuration)
iostat (to determine if the actual load on the adapters is out of balance)

Problem:

Imbalance in allocation of paging space on physical volumes with paging space

Response:

A substantial imbalance in the sizes of paging spaces can cause performance problems. The paging space should be equally distributed throughout the disks. Consider making paging spaces the same size, except for a few extra megabytes on the primary paging space (hd6).

Useful cmds:

smitty pgsp

Problem:

Fragmentation of a paging space in a volume group

Response:

Paging performance is better if paging areas are contiguous on a physical volume. However, when paging areas are enlarged, it is possible to create fragments that are scattered across the disk surface. Use the reorgvg command to reorganize the paging spaces.

Useful cmds:

lspv -p hdiskn for each physical volume in the volume group. Look for more than one PP Range with the same LVNAME and a TYPE of paging.

Problem:

Significant imbalance in measured I/O load to physical volumes

Response:

The data is most likely not well-distributed throughout the disks. Use the iostat command to obtain information about the I/O activity of each disk (refer to Assessing Disk Performance with the iostat Command). A disk should not be utilized more than 40 percent over a period of time.
If one physical volume seems to be getting little I/O activity, consider moving data from busier physical volumes onto less busy volumes. In general, the more evenly the I/O is distributed, the better the performance.
Distribute data throughout the disks in a manner that balances I/O. Use the filemon command to obtain information about the most accessed files and file systems. This can be a good starting point in reorganizing the data. Refer to Detailed I/O Analysis with the filemon Command, for more information on the filemon command.

Useful cmds:

iostat -d 2 20 (to view the current distribution of I/O across physical volumes)

Problem:

New process is a heavy consumer of memory or CPU

Response:

Top CPU and memory consumers are regularly identified by PDT. If any of these processes have not been detected before, they are highlighted in a problem report. Examine these processes for unusual behavior. Note that PDT simply looks at the process ID. If a known heavy user terminates, then is resumed (with a different process ID), it will be identified here as a new heavy user.

Useful cmds:

ps aucg (to view all processes and their activity)

Problem:

Any file in .files exhibits systematic growth (or decline) in size

Response:

Look at the current size. Consider the projected growth rate. What user or application is generating the data? For example, the /var/adm/wtmp file is liable to grow unbounded. If it gets too large, login times can increase. In some cases, the solution is to delete the file. In most cases, it is important to identify the user causing the growth and work with that user to correct the problem.

Useful cmds:

ls -al (to view file/directory sizes)

Problem:

Any file system or paging space exhibits systematic growth (or decline) in space used

Response:

Consider the projected growth rate and expected time until exceeding the available space. Analyze the problem by identifying which user or process is generating the data. It may be necessary to enlarge the file system (or page space). On the other hand, the growth may be an undesirable effect (for example, a process having a memory leak).

Useful cmds:

smitty (to manipulate file systems/page spaces)
ps aucg, svmon (to view process virtual memory activity)
filemon (to view file system activity)

Problem:

Degradation in ping response time or packet loss percentage for any host in .nodes

Response:

There is probably a performance problem in the host or in the network. Is the host in question experiencing performance problems? Is the network having performance problems?

Useful cmds:

ping, rlogin, rsh (to time known workloads on remote host)

Problem:

A getty process that consumes too much CPU time

Response:

Getty processes that use more than just a few percent of the CPU may be in error. It is possible in certain situations for these processes to consume system CPU, even though no users are actually logged in. In general, the solution is to terminate the process.

Useful cmds:

ps aucg (to see how much CPU is being used)

Problem:

A process that is a top consumer of CPU or memory resources exhibits systematic growth or decline in consumption

Response:

Known large consumers of CPU and memory resources are tracked over time to see if their demands grow. As major consumers, a steady growth in their demand is of interest from several perspectives. If the growth is normal, this represents useful capacity planning information. If the growth is unexpected, then evaluate the workload for a change (or a chronic problem, such as a memory leak). Use the vmstat and svmon commands while the process is running to gather more information on its behavior.

Useful cmds:

ps aucg, vmstat, svmon

Problem:

maxuproc indicated as being possibly too low for a particular userid

Response:

It is likely that this user is reaching the maxuproc threshold.
maxuproc is a systemwide parameter that limits the number of processes that nonroot users are allowed to have simultaneously active. If the limit is too low, the user's work can be delayed or terminated. On the other hand, the user might be accidentally creating more processes than needed or appropriate. Further investigation is warranted in either case. Consult the user in order to clearly understand what is happening.

Useful cmds:

lsattr -E -l sys0 | grep maxuproc to determine the current value of maxuproc (although it is also reported directly in the PDT message).
chdev -l sys0 -a maxuproc=100 to change maxuproc to 100 (for example). Root user authority is required.

Problem:

A WORKLOAD TRACKING indicator shows an upward trend

Response:

The response depends on which workload indicator shows the trend:

loadavg: Refers to 15-minute load average. In general, it indicates that the level of contention in the system is growing. Examine the rest of the PDT report for indicators of system bottlenecks (for example, substantial page space use might indicate a memory shortage; I/O imbalances might indicate that the I/O subsystem requires attention).
nusers: Shows that the number of logged-on users on the system is growing. This is important from a capacity planning perspective. Is the growth expected? Can it be explained?
nprocesses: Indicates that the total number of processes on the system is growing. Are there users reaching the maxuproc limitation? Perhaps there are "runaway" applications forking too many processes.
STAT_A: Number of active processes. A trend here indicates processes are spending more time waiting for the CPU.
STAT_W: Number of swapped processes. A trend here indicates that processes are contending excessively for memory.
STAT_Z: Number of zombie processes. Zombies should not stay around for a long period of time. If the number of zombies on a system is growing, this may be cause for concern.
STAT_I: Number of idle processes.
STAT_T: Number of processes stopped after receiving a signal. A trend here might indicate a programming error.
STAT_x: Number of processes reported by the ps command as being in state x, where x is a state not listed in the other STAT_* states. The interpretation of a trend depends on the meaning of the character x. Refer to Using the ps Command, for more information on the ps command.
cp: Time required to copy a 40 KB file. An upward trend in the time to do a file copy suggests degradation in the I/O subsystem.
idle_pct_cpu0: Idle percentage for processor 0. An upward trend in the idle percentage might indicate increased contention in non-CPU resources such as paging or I/O. Such an increase suggests the CPU resource is not being well-utilized.
idle_pct_avg: Average idle percentage for all processors. An upward trend in the idle percentage might indicate increased contention in non-CPU resources such as paging or I/O. Such an increase suggests the CPU resource is not being well-utilized.