[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]

Performance Management Guide


Monitoring CPU Use

The processing unit is one of the fastest components of the system. It is comparatively rare for a single program to keep the CPU 100 percent busy (that is, 0 percent idle and 0 percent wait) for more than a few seconds at a time. Even in heavily loaded multiuser systems, there are occasional 10 milliseconds (ms) periods that end with all threads in a wait state. If a monitor shows the CPU 100 percent busy for an extended period, there is a good chance that some program is in an infinite loop. Even if the program is "merely" expensive, rather than broken, it needs to be identified and dealt with.

The vmstat Command (CPU)

The first tool to use is the vmstat command, which quickly provides compact information about various system resources and their related performance problems. The vmstat command reports statistics about kernel threads in the run and wait queue, memory, paging, disks, interrupts, system calls, context switches, and CPU activity. The reported CPU activity is a percentage breakdown of user mode, system mode, idle time, and waits for disk I/O.

Note: If the vmstat command is used without any options or only with the interval and optionally, the count parameter, such as vmstat 2 10; then the first line of numbers is an average since system reboot.

As a CPU monitor, the vmstat command is superior to the iostat command in that its one-line-per-report output is easier to scan as it scrolls and there is less overhead involved if there are a lot of disks attached to the system. The following example can help you identify situations in which a program has run away or is too CPU-intensive to run in a multiuser environment.

# vmstat 2
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 1  0 22478  1677   0   0   0   0    0   0 188 1380 157 57 32  0 10
 1  0 22506  1609   0   0   0   0    0   0 214 1476 186 48 37  0 16
 0  0 22498  1582   0   0   0   0    0   0 248 1470 226 55 36  0  9
 
 2  0 22534  1465   0   0   0   0    0   0 238  903 239 77 23  0  0
 2  0 22534  1445   0   0   0   0    0   0 209 1142 205 72 28  0  0
 2  0 22534  1426   0   0   0   0    0   0 189 1220 212 74 26  0  0
 3  0 22534  1410   0   0   0   0    0   0 255 1704 268 70 30  0  0
 2  1 22557  1365   0   0   0   0    0   0 383  977 216 72 28  0  0
 
 2  0 22541  1356   0   0   0   0    0   0 237 1418 209 63 33  0  4
 1  0 22524  1350   0   0   0   0    0   0 241 1348 179 52 32  0 16
 1  0 22546  1293   0   0   0   0    0   0 217 1473 180 51 35  0 14

This output shows the effect of introducing a program in a tight loop to a busy multiuser system. The first three reports (the summary has been removed) show the system balanced at 50-55 percent user, 30-35 percent system, and 10-15 percent I/O wait. When the looping program begins, all available CPU cycles are consumed. Because the looping program does no I/O, it can absorb all of the cycles previously unused because of I/O wait. Worse, it represents a process that is always ready to take over the CPU when a useful process relinquishes it. Because the looping program has a priority equal to that of all other foreground processes, it will not necessarily have to give up the CPU when another process becomes dispatchable. The program runs for about 10 seconds (five reports), and then the activity reported by the vmstat command returns to a more normal pattern.

The CPU statistics can be somewhat distorted on systems with very high device-interrupt load. This situation is due to the fact that the tool samples on timer interrupts. The timer is the lowest priority device and therefore it can easily be preempted by other interrupts. To eliminate this distortion, operating system versions later than AIX 4.3.3 use a different method to sample the timer.

Note: For SMP systems the us, sy, id and wa columns are only averages over the processors (the sar command can report per-processor statistics). But keep in mind that the I/O wait statistic per processor is not really a processor-specific statistic; it is a global statistic. An I/O wait is distinguished from idle time only by the state of a pending I/O. If there is any pending disk I/O, and the processor is not busy, then it is an I/O wait time. AIX 4.3.3 and later contains an enhancement to the method used to compute the percentage of CPU time spent waiting on disk I/O (wio time). See Wait I/O Time Reporting for more details.

Optimum use would have the CPU working 100 percent of the time. This holds true in the case of a single-user system with no need to share the CPU. Generally, if us + sy time is below 90 percent, a single-user system is not considered CPU constrained. However, if us + sy time on a multiuser system exceeds 80 percent, the processes may spend time waiting in the run queue. Response time and throughput might suffer.

To check if the CPU is the bottleneck, consider the four cpu columns and the two kthr (kernel threads) columns in the vmstat report. It may also be worthwhile looking at the faults column:

The iostat Command

The iostat command is the fastest way to get a first impression, whether or not the system has a disk I/O-bound performance problem (see Assessing Disk Performance with the iostat Command). The tool also reports CPU statistics.

The following example shows a part of an iostat command output. The first stanza shows the summary statistic since system startup.

# iostat -t 2 6
tty:      tin         tout   avg-cpu:  % user    % sys     % idle    % iowait
          0.0          0.8               8.4      2.6       88.5       0.5
          0.0         80.2               4.5      3.0       92.1       0.5
          0.0         40.5               7.0      4.0       89.0       0.0
          0.0         40.5               9.0      2.5       88.5       0.0
          0.0         40.5               7.5      1.0       91.5       0.0
          0.0         40.5              10.0      3.5       80.5       6.0

The CPU statistics columns (% user, % sys, % idle, and % iowait) provide a breakdown of CPU usage. This information is also reported in the vmstat command output in the columns labeled us, sy, id, and wa. For a detailed explanation for the values, see The vmstat Command. Also note the change made to %iowait described in Wait I/O Time Reporting.

The sar Command

The sar command gathers statistical data about the system. Though it can be used to gather some useful data regarding system performance, the sar command can increase the system load that can exacerbate a pre-existing performance problem if the sampling frequency is high. But compared to the accounting package, the sar command is less intrusive. The system maintains a series of system activity counters which record various activities and provide the data that the sar command reports. The sar command does not cause these counters to be updated or used; this is done automatically regardless of whether or not the sar command runs. It merely extracts the data in the counters and saves it, based on the sampling rate and number of samples specified to the sar command.

With its numerous options, the sar command provides queuing, paging, TTY, and many other statistics. One important feature of the sar command is that it reports either systemwide (global among all processors) CPU statistics (which are calculated as averages for values expressed as percentages, and as sums otherwise), or it reports statistics for each individual processor. Therefore, this command is particularly useful on SMP systems.

There are three situations to use the sar command:

Real-time sampling and display

To collect and display system statistic reports immediately, use the following command:

# sar -u 2 5
 
AIX texmex 3 4 000691854C00    01/27/00
 
17:58:15    %usr    %sys    %wio   %idle
17:58:17      43       9       1      46
17:58:19      35      17       3      45
17:58:21      36      22      20      23
17:58:23      21      17       0      63
17:58:25      85      12       3       0
 
Average       44      15       5      35

This example is from a single user workstation and shows the CPU utilization.

Display previously captured data

The -o and -f options (write and read to/from user given data files) allow you to visualize the behavior of your machine in two independent steps. This consumes less resources during the problem-reproduction period. You can use a separate machine to analyze the data by transferring the file because the collected binary file keeps all data the sar command needs.

# sar -o /tmp/sar.out 2 5 > /dev/null

The above command runs the sar command in the background, collects system activity data at 2-second intervals for 5 intervals, and stores the (unformatted) sar data in the /tmp/sar.out file. The redirection of standard output is used to avoid a screen output.

The following command extracts CPU information from the file and outputs a formatted report to standard output:

# sar -f/tmp/sar.out
 
AIX texmex 3 4 000691854C00    01/27/00
 
18:10:18    %usr    %sys    %wio   %idle
18:10:20       9       2       0      88
18:10:22      13      10       0      76
18:10:24      37       4       0      59
18:10:26       8       2       0      90
18:10:28      20       3       0      77
 
Average       18       4       0      78

The captured binary data file keeps all information needed for the reports. Every possible sar report could therefore be investigated. This also allows to display the processor-specific information of an SMP system on a single processor system.

System activity accounting via cron daemon

The sar command calls a process named sadc to access system data. Two shell scripts (/usr/lib/sa/sa1 and /usr/lib/sa/sa2) are structured to be run by the cron daemon and provide daily statistics and reports. Sample stanzas are included (but commented out) in the /var/spool/cron/crontabs/adm crontab file to specify when the cron daemon should run the shell scripts.

The following lines show a modified crontab for the adm user. Only the comment characters for the data collections were removed:

#=================================================================
#      SYSTEM ACTIVITY REPORTS
#  8am-5pm activity reports every 20 mins during weekdays.
#  activity reports every an hour on Saturday and Sunday.
#  6pm-7am activity reports every an hour during weekdays.
#  Daily summary prepared at 18:05.
#=================================================================
0 8-17 * * 1-5 /usr/lib/sa/sa1 1200 3 &
0 * * * 0,6 /usr/lib/sa/sa1 &
0 18-7 * * 1-5 /usr/lib/sa/sa1 &
5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 3600 -ubcwyaqvm &
#=================================================================

Collection of data in this manner is useful to characterize system usage over a period of time and to determine peak usage hours.

Useful CPU Options

The most useful CPU-related options for the sar command are:

The xmperf Program

Using the xmperf program displays CPU use as a moving skyline chart. The xmperf program is described in detail in the Performance Toolbox Version 2 and 3 for AIX: Guide and Reference.


[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]