[ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home |
Legal |
Search ]
Performance Management Guide
Responding to PDT Report Messages
PDT identifies many types of problems. Responses to these indications depend
on the individual organization's available resources and set of priorities.
The following samples suggest some possibilities (cmds stands for commands):
- Problem:
- JFS file system becomes unavailable
- Response:
- Investigate why file system is unavailable. The file system could have
been removed.
- Useful cmds:
- lsfs (to determine file system status)
- Problem:
- JFS file system nearly full
- Response:
- This problem could be caused by large or core files within the file
system. Look for large files in the file system, possibly caused by a runaway
process. Attempt to identify the process or user that generated those files.
The system administrator should also verify if PDT report indicates a long-term
growth trend for this file system. Has this file system exhibited long-term
growth trend (examine the rest of the PDT report or past PDT reports)?
- Useful cmds:
- du, ls
- Problem:
- Physical volume not allocated to a volume group
- Response:
- If a physical volume is not allocated to a volume group, the operating
system has no access to this disk, and its space is being wasted. Use the lspv command to ensure that the disk is not allocated to
any volume group, and if not, use the extendvg command
to add the disk to a volume group.
- Useful cmds:
- lspv (to confirm that the volume is not allocated)
smitty (to manipulate volume groups)
- Problem:
- All paging spaces defined on one physical volume
- Response:
- The system has more than one physical volume, yet all paging space is
defined on a single volume. If the system experiences paging, this configuration
will result in reduced performance. A better I/O throughput could be achieved
if the paging space is split equally among all physical volumes. Only one
paging space should be defined per physical volume because the system will
only have access to one at a time. Use SMIT to create, modify, activate, or
deactivate the paging areas.
- Useful cmds:
- smitty (to modify paging spaces)
- Problem:
- Apparently too little memory for current workload
- Response:
- If the system is paging heavily, more memory may be required on the
system for good performance. The vmstat or svmon commands provide further details about the paging activity. See The vmstat Command, and The svmon
Command, for more information on those commands.
- Useful cmds:
- lsps -a, vmstat, svmon
- Problem:
- Page space nearly full
- Response:
- The system's paging space may not be well dimensioned and may need to
be enlarged, unless the problem is due to a process with a memory leak, in
which case that process should be identified and the application fixed. For
systems up to 256 MB of memory, the paging space should be twice the size
of real memory. For memories larger than 256 MB, use the following formula:
total
paging space = 512 MB + (memory size - 256 MB) * 1.25
- Useful cmds:
- ps aucg (to examine process activity)
smitty (to modify page space characteristics)
- Problem:
- Possible problems in the settings of load control parameters
- Response:
- The memory-load-control parameters are evaluated in relation to current
paging activity. For example, if thrashing is occurring and load control is
not enabled, it may be appropriate to enable load control. This situation
might also be due to inappropriate load-control parameter settings. Use the schedtune command to view or alter the configuration. Refer
to Tuning VMM Memory Load Control with the schedtune Command, for further details.
- Useful cmds
- schedtune
- Problem:
- VMM-detected bad memory frames
- Response:
- It might be necessary to have the memory analyzed. Compare the amount
of installed memory with the memory actually accessible. If the latter is
less than the former, then bad memory has been identified.
You
can use /usr/sbin/perf/diag_tool/getvmparms to examine
the value of numframes to determine the actual number of valid 4 KB memory
frames.
- Useful cmds:
- lscfg | grep mem (to obtain installed memory size
in MB)
- Problem:
- Any host in .nodes becomes unreachable
- Response:
- Determine if problem is with current host (has a change in the /etc/hosts file been made?), with the remote host (is it down?), or with
the network (is the nameserver down?).
The problem
may be due to name resolution. Either Domain Name Service (DNS) configuration
files or /etc/hosts should be checked depending on the
type of name resolution being used at the environment.
Use
the ping command to check if the machine has access
to other nodes in the same network. The remote node might be down. If it cannot
access any other node, cables and connections should be verified, as well
as the routing table of the current machine. Verify cables and connections
by executing the netstat -r command.
- Useful cmds:
- ping, netstat
- Problem:
- Imbalance in the I/O configuration (number of disks
per adapter)
- Response:
- The number of disks per adapter should be equal whenever possible, to
prevent one adapter from being overloaded. A guideline is to have no more
than four devices per adapter, especially if the access to the disks is mostly
sequential. Consider moving disks around so that an individual adapter is
not overloaded.
- Useful cmds:
- lscfg (to examine the current configuration)
iostat (to determine if the actual load on the adapters
is out of balance)
- Problem:
- Imbalance in allocation of paging space on physical
volumes with paging space
- Response:
- A substantial imbalance in the sizes of paging spaces can cause performance
problems. The paging space should be equally distributed throughout the disks.
Consider making paging spaces the same size, except for a few extra megabytes
on the primary paging space (hd6).
- Useful cmds:
- smitty pgsp
- Problem:
- Fragmentation of a paging space in a volume group
- Response:
- Paging performance is better if paging areas are contiguous on a physical
volume. However, when paging areas are enlarged, it is possible to create
fragments that are scattered across the disk surface. Use the reorgvg command to reorganize the paging spaces.
- Useful cmds:
- lspv -p hdiskn for each
physical volume in the volume group. Look for more than one PP Range with
the same LVNAME and a TYPE of paging.
- Problem:
- Significant imbalance in measured I/O load to physical
volumes
- Response:
- The data is most likely not well-distributed throughout the disks. Use
the iostat command to obtain information about the I/O
activity of each disk (refer to Assessing Disk Performance
with the iostat Command). A disk should not be utilized more than 40 percent
over a period of time.
If one physical volume seems
to be getting little I/O activity, consider moving data from busier physical
volumes onto less busy volumes. In general, the more evenly the I/O is distributed,
the better the performance.
Distribute data throughout
the disks in a manner that balances I/O. Use the filemon command to obtain information about the most accessed files and file
systems. This can be a good starting point in reorganizing the data. Refer
to Detailed I/O Analysis with the filemon Command,
for more information on the filemon command.
- Useful cmds:
- iostat -d 2 20 (to view the current distribution
of I/O across physical volumes)
- Problem:
- New process is a heavy consumer of memory or CPU
- Response:
- Top CPU and memory consumers are regularly identified by PDT. If any
of these processes have not been detected before, they are highlighted in
a problem report. Examine these processes for unusual behavior. Note that
PDT simply looks at the process ID. If a known heavy user terminates, then
is resumed (with a different process ID), it will be identified here as a new heavy user.
- Useful cmds:
- ps aucg (to view all processes and their activity)
- Problem:
- Any file in .files exhibits systematic growth (or decline)
in size
- Response:
- Look at the current size. Consider the projected growth rate. What user
or application is generating the data? For example, the /var/adm/wtmp file is liable to grow unbounded. If it gets too large,
login times can increase. In some cases, the solution is to delete the file.
In most cases, it is important to identify the user causing the growth and
work with that user to correct the problem.
- Useful cmds:
- ls -al (to view file/directory sizes)
- Problem:
- Any file system or paging space exhibits systematic
growth (or decline) in space used
- Response:
- Consider the projected growth rate and expected time until exceeding
the available space. Analyze the problem by identifying which user or process
is generating the data. It may be necessary to enlarge the file system (or
page space). On the other hand, the growth may be an undesirable effect (for
example, a process having a memory leak).
- Useful cmds:
- smitty (to manipulate file systems/page spaces)
ps aucg, svmon (to view process virtual memory activity)
filemon (to view file system activity)
- Problem:
- Degradation in ping response time or packet loss percentage
for any host in .nodes
- Response:
- There is probably a performance problem in the host or in the network.
Is the host in question experiencing performance problems? Is the network
having performance problems?
- Useful cmds:
- ping, rlogin, rsh (to time known workloads on remote host)
- Problem:
- A getty process that consumes too much CPU time
- Response:
- Getty processes that use more than just a few percent of the CPU may
be in error. It is possible in certain situations for these processes to consume
system CPU, even though no users are actually logged in. In general, the solution
is to terminate the process.
- Useful cmds:
- ps aucg (to see how much CPU is being used)
- Problem:
- A process that is a top consumer of CPU or memory resources
exhibits systematic growth or decline in consumption
- Response:
- Known large consumers of CPU and memory resources are tracked over time
to see if their demands grow. As major consumers, a steady growth in their
demand is of interest from several perspectives. If the growth is normal,
this represents useful capacity planning information. If the growth is unexpected,
then evaluate the workload for a change (or a chronic problem, such as a memory
leak). Use the vmstat and svmon
commands while the process is running to gather more information on its behavior.
- Useful cmds:
- ps aucg, vmstat, svmon
- Problem:
- maxuproc indicated as being possibly too low for a
particular userid
- Response:
- It is likely that this user is reaching the maxuproc threshold.
maxuproc is
a systemwide parameter that limits the number of processes that nonroot users
are allowed to have simultaneously active. If the limit is too low, the user's
work can be delayed or terminated. On the other hand, the user might be accidentally
creating more processes than needed or appropriate. Further investigation
is warranted in either case. Consult the user in order to clearly understand
what is happening.
- Useful cmds:
- lsattr -E -l sys0 | grep maxuproc to determine
the current value of maxuproc (although it is also reported
directly in the PDT message).
chdev
-l sys0 -a maxuproc=100 to change maxuproc to 100
(for example). Root user authority is required.
- Problem:
- A WORKLOAD TRACKING indicator shows an upward trend
- Response:
- The response depends on which workload indicator shows the trend:
- loadavg
- Refers to 15-minute load average. In general, it indicates that the
level of contention in the system is growing. Examine the rest of the PDT
report for indicators of system bottlenecks (for example, substantial page
space use might indicate a memory shortage; I/O imbalances might indicate
that the I/O subsystem requires attention).
- nusers
- Shows that the number of logged-on users on the system is growing. This
is important from a capacity planning perspective. Is the growth expected?
Can it be explained?
- nprocesses
- Indicates that the total number of processes on the system is growing.
Are there users reaching the maxuproc limitation? Perhaps
there are "runaway" applications forking too many processes.
- STAT_A
- Number of active processes. A trend here indicates processes are spending
more time waiting for the CPU.
- STAT_W
- Number of swapped processes. A trend here indicates that processes are
contending excessively for memory.
- STAT_Z
- Number of zombie processes. Zombies should not stay around for a long
period of time. If the number of zombies on a system is growing, this may
be cause for concern.
- STAT_I
- Number of idle processes.
- STAT_T
- Number of processes stopped after receiving a signal. A trend here might
indicate a programming error.
- STAT_x
- Number of processes reported by the ps command
as being in state x, where x
is a state not listed in the other STAT_* states. The interpretation of a trend depends on the meaning of the
character x. Refer to Using the
ps Command, for more information on the ps command.
- cp
- Time required to copy a 40 KB file. An upward trend in the time to do
a file copy suggests degradation in the I/O subsystem.
- idle_pct_cpu0
- Idle percentage for processor 0. An upward trend in the idle percentage
might indicate increased contention in non-CPU resources such as paging or
I/O. Such an increase suggests the CPU resource is not being well-utilized.
- idle_pct_avg
- Average idle percentage for all processors. An upward trend in the idle
percentage might indicate increased contention in non-CPU resources such as
paging or I/O. Such an increase suggests the CPU resource is not being well-utilized.
[ Top of Page | Previous Page | Next Page | Contents | Index | Library Home |
Legal |
Search ]