[ Previous | Next | Table of Contents | Index | Library Home |
Legal |
Search ]
Performance Management Guide
PDT identifies many types of problems. Responses to these
indications depend on the individual organization's available resources
and set of priorities. The following samples suggest some possibilities
(cmds stands for commands):
- Problem:
- JFS file system becomes unavailable
- Response:
- Investigate why file system is unavailable. The file system could
have been removed.
- Useful cmds:
- lsfs (to determine file system status)
- Problem:
- JFS file system nearly full
- Response:
- This problem could be caused by large or core files within the file
system. Look for large files in the file system, possibly caused by a
runaway process. Attempt to identify the process or user that generated
those files. The system administrator should also verify if PDT report
indicates a long-term growth trend for this file system. Has this file
system exhibited long-term growth trend (examine the rest of the PDT report or
past PDT reports)?
- Useful cmds:
- du, ls
- Problem:
- Physical volume not allocated to a volume group
- Response:
- If a physical volume is not allocated to a volume group, the operating
system has no access to this disk, and its space is being wasted. Use
the lspv command to ensure that the disk is not allocated to any
volume group, and if not, use the extendvg command to add the disk
to a volume group.
- Useful cmds:
- lspv (to confirm that the volume is not allocated)
smitty (to manipulate volume groups)
- Problem:
- All paging spaces defined on one physical volume
- Response:
- The system has more than one physical volume, yet all paging space is
defined on a single volume. If the system experiences paging, this
configuration will result in reduced performance. A better I/O
throughput could be achieved if the paging space is split equally among all
physical volumes. Only one paging space should be defined per physical
volume because the system will only have access to one at a time. Use
SMIT to create, modify, activate, or deactivate the paging areas.
- Useful cmds:
- smitty (to modify paging spaces)
- Problem:
- Apparently too little memory for current workload
- Response:
- If the system is paging heavily, more memory may be required on the system
for good performance. The vmstat or svmon
commands provide further details about the paging activity. See The vmstat Command, and The svmon
Command, for more information on those commands.
- Useful cmds:
- lsps -a, vmstat, svmon
- Problem:
- Page space nearly full
- Response:
- The system's paging space may not be well dimensioned and may need to
be enlarged, unless the problem is due to a process with a memory leak, in
which case that process should be identified and the application fixed.
For systems up to 256 MB of memory, the paging space should be twice the size
of real memory. For memories larger than 256 MB, use the following
formula:
total paging space = 512 MB + (memory size - 256 MB) * 1.25
- Useful cmds:
- ps aucg (to examine process activity)
smitty (to modify page space characteristics)
- Problem:
- Possible problems in the settings of load control parameters
- Response:
- The memory-load-control parameters are evaluated in relation to current
paging activity. For example, if thrashing is occurring and load
control is not enabled, it may be appropriate to enable load control.
This situation might also be due to inappropriate load-control parameter
settings. Use the schedtune command to view or alter the
configuration. Refer to Tuning VMM Memory Load Control
with the schedtune Command, for further details.
- Useful cmds
- schedtune
- Problem:
- VMM-detected bad memory frames
- Response:
- It might be necessary to have the memory analyzed. Compare the
amount of installed memory with the memory actually accessible. If the
latter is less than the former, then bad memory has been identified.
You can use /usr/sbin/perf/diag_tool/getvmparms to examine the
value of numframes to determine the actual number of valid 4 KB memory
frames.
- Useful cmds:
- lscfg | grep mem (to obtain installed memory size in MB)
- Problem:
- Any host in .nodes becomes unreachable
- Response:
- Determine if problem is with current host (has a change in the
/etc/hosts file been made?), with the remote host (is it down?), or
with the network (is the nameserver down?).
The problem may be due to name resolution. Either Domain Name Service
(DNS) configuration files or /etc/hosts should be checked depending
on the type of name resolution being used at the environment.
Use the ping command to check if the machine has access to other
nodes in the same network. The remote node might be down. If it
cannot access any other node, cables and connections should be verified, as
well as the routing table of the current machine. Verify cables and
connections by executing the netstat -r command.
- Useful cmds:
- ping, netstat
- Problem:
- Imbalance in the I/O configuration (number of disks per
adapter)
- Response:
- The number of disks per adapter should be equal whenever possible, to
prevent one adapter from being overloaded. A guideline is to have no
more than four devices per adapter, especially if the access to the disks is
mostly sequential. Consider moving disks around so that an individual
adapter is not overloaded.
- Useful cmds:
- lscfg (to examine the current configuration)
iostat (to determine if the actual load on the adapters is out of
balance)
- Problem:
- Imbalance in allocation of paging space on physical volumes with
paging space
- Response:
- A substantial imbalance in the sizes of paging spaces can cause
performance problems. The paging space should be equally distributed
throughout the disks. Consider making paging spaces the same size,
except for a few extra megabytes on the primary paging space (hd6).
- Useful cmds:
- smitty pgsp
- Problem:
- Fragmentation of a paging space in a volume group
- Response:
- Paging performance is better if paging areas are contiguous on a physical
volume. However, when paging areas are enlarged, it is possible to
create fragments that are scattered across the disk surface. Use the
reorgvg command to reorganize the paging spaces.
- Useful cmds:
- lspv -p hdiskn for each physical volume in the
volume group. Look for more than one PP Range with the same LVNAME and
a TYPE of paging.
- Problem:
- Significant imbalance in measured I/O load to physical volumes
- Response:
- The data is most likely not well-distributed throughout the disks.
Use the iostat command to obtain information about the I/O activity
of each disk (refer to Assessing Disk Performance with the
iostat Command). A disk should not be utilized more than 40 percent
over a period of time.
If one physical volume seems to be getting little I/O activity, consider
moving data from busier physical volumes onto less busy volumes. In
general, the more evenly the I/O is distributed, the better the
performance.
Distribute data throughout the disks in a manner that balances I/O. Use
the filemon command to obtain information about the most accessed
files and file systems. This can be a good starting point in
reorganizing the data. Refer to Detailed I/O Analysis
with the filemon Command, for more information on the filemon
command.
- Useful cmds:
- iostat -d 2 20 (to view the current distribution of I/O across
physical volumes)
- Problem:
- New process is a heavy consumer of memory or CPU
- Response:
- Top CPU and memory consumers are regularly identified by PDT. If
any of these processes have not been detected before, they are highlighted in
a problem report. Examine these processes for unusual behavior.
Note that PDT simply looks at the process ID. If a known heavy user
terminates, then is resumed (with a different process ID), it will be
identified here as a new heavy user.
- Useful cmds:
- ps aucg (to view all processes and their activity)
- Problem:
- Any file in .files exhibits systematic growth (or decline) in
size
- Response:
- Look at the current size. Consider the projected growth
rate. What user or application is generating the data? For example, the
/var/adm/wtmp file is liable to grow unbounded. If it gets
too large, login times can increase. In some cases, the solution is to
delete the file. In most cases, it is important to identify the user
causing the growth and work with that user to correct the problem.
- Useful cmds:
- ls -al (to view file/directory sizes)
- Problem:
- Any file system or paging space exhibits systematic growth (or
decline) in space used
- Response:
- Consider the projected growth rate and expected time until exceeding the
available space. Analyze the problem by identifying which user or
process is generating the data. It may be necessary to enlarge the file
system (or page space). On the other hand, the growth may be an
undesirable effect (for example, a process having a memory leak).
- Useful cmds:
- smitty (to manipulate file systems/page spaces)
ps aucg, svmon (to view process virtual memory activity)
filemon (to view file system activity)
- Problem:
- Degradation in ping response time or packet loss percentage for any
host in .nodes
- Response:
- There is probably a performance problem in the host or in the
network. Is the host in question experiencing performance problems? Is
the network having performance problems?
- Useful cmds:
- ping, rlogin, rsh (to time known
workloads on remote host)
- Problem:
- A getty process that consumes too much CPU time
- Response:
- Getty processes that use more than just a few percent of the CPU may be in
error. It is possible in certain situations for these processes to
consume system CPU, even though no users are actually logged in. In
general, the solution is to terminate the process.
- Useful cmds:
- ps aucg (to see how much CPU is being used)
- Problem:
- A process that is a top consumer of CPU or memory resources exhibits
systematic growth or decline in consumption
- Response:
- Known large consumers of CPU and memory resources are tracked over time to
see if their demands grow. As major consumers, a steady growth in their
demand is of interest from several perspectives. If the growth is
normal, this represents useful capacity planning information. If the
growth is unexpected, then evaluate the workload for a change (or a chronic
problem, such as a memory leak). Use the vmstat and
svmon commands while the process is running to gather more
information on its behavior.
- Useful cmds:
- ps aucg, vmstat, svmon
- Problem:
- maxuproc indicated as being possibly too low for a particular
userid
- Response:
- It is likely that this user is reaching the maxuproc
threshold.
maxuproc is a systemwide parameter that limits the number of
processes that nonroot users are allowed to have simultaneously active.
If the limit is too low, the user's work can be delayed or
terminated. On the other hand, the user might be accidentally creating
more processes than needed or appropriate. Further investigation is
warranted in either case. Consult the user in order to clearly
understand what is happening.
- Useful cmds:
- lsattr -E -l sys0 | grep maxuproc to determine the current
value of maxuproc (although it is also reported directly in the PDT
message).
chdev -l sys0 -a maxuproc=100 to change maxuproc to 100
(for example). Root user authority is required.
- Problem:
- A WORKLOAD TRACKING indicator shows an upward trend
- Response:
- The response depends on which workload indicator shows the trend:
- loadavg
- Refers to 15-minute load average. In general, it indicates that the
level of contention in the system is growing. Examine the rest of the
PDT report for indicators of system bottlenecks (for example, substantial page
space use might indicate a memory shortage; I/O imbalances might indicate
that the I/O subsystem requires attention).
- nusers
- Shows that the number of logged-on users on the system is growing.
This is important from a capacity planning perspective. Is the growth
expected? Can it be explained?
- nprocesses
- Indicates that the total number of processes on the system is
growing. Are there users reaching the maxuproc limitation?
Perhaps there are "runaway" applications forking too many processes.
- STAT_A
- Number of active processes. A trend here indicates processes are
spending more time waiting for the CPU.
- STAT_W
- Number of swapped processes. A trend here indicates that processes
are contending excessively for memory.
- STAT_Z
- Number of zombie processes. Zombies should not stay around for a
long period of time. If the number of zombies on a system is growing,
this may be cause for concern.
- STAT_I
- Number of idle processes.
- STAT_T
- Number of processes stopped after receiving a signal. A trend here
might indicate a programming error.
- STAT_x
- Number of processes reported by the ps command as being in
state x, where x is a state not listed in the other
STAT_* states. The interpretation of a trend
depends on the meaning of the character x. Refer to Using the ps Command, for more information on the
ps command.
- cp
- Time required to copy a 40 KB file. An upward trend in the time to
do a file copy suggests degradation in the I/O subsystem.
- idle_pct_cpu0
- Idle percentage for processor 0. An upward trend in the idle
percentage might indicate increased contention in non-CPU resources such as
paging or I/O. Such an increase suggests the CPU resource is not being
well-utilized.
- idle_pct_avg
- Average idle percentage for all processors. An upward trend in the
idle percentage might indicate increased contention in non-CPU resources such
as paging or I/O. Such an increase suggests the CPU resource is not
being well-utilized.
-
-
[ Previous | Next | Table of Contents | Index |
Library Home |
Legal |
Search ]