Performance Management Guide

Monitoring Disk I/O

When you are monitoring disk I/O, use the following to determine your course of action:

Find the most active files, file systems, and logical volumes:
- Can "hot" file systems be better located on the physical drive or be spread across multiple physical drives? (lslv, iostat, filemon)
- Are "hot" files local or remote? (filemon)
- Does paging space dominate disk utilization? (vmstat, filemon)
- Is there enough memory to cache the file pages being used by running processes? (vmstat, svmon, vmtune)
- Does the application perform a lot of synchronous (non-cached) file I/O?
Determine file fragmentation:
- Are "hot" files heavily fragmented? (fileplace)
Find the physical volume with the highest utilization:
- Is the type of drive or I/O adapter causing a bottleneck? (iostat, filemon)

Building a Pre-Tuning Baseline

Before you make significant changes in your disk configuration or tuning parameters, it is a good idea to build a baseline of measurements that record the current configuration and performance.

Wait I/O Time Reporting

AIX 4.3.3 and later contain enhancements to the method used to compute the percentage of CPU time spent waiting on disk I/O (wio time). The method used in AIX 4.3.2 and earlier versions of the operating system can, under certain circumstances, give an inflated view of wio time on SMPs. The wio time is reported by the commands sar (%wio), vmstat (wa) and iostat (% iowait).

Another change is that the wa column details the percentage of time the CPU was idle with pending disk I/O to not only local, but also NFS-mounted disks.

Method Used in AIX 4.3.2 and Earlier

At each clock interrupt on each processor (100 times a second per processor), a determination is made as to which of the four categories (usr/sys/wio/idle) to place the last 10 ms of time. If the CPU was busy in usr mode at the time of the clock interrupt, then usr gets the clock tick added into its category. If the CPU was busy in kernel mode at the time of the clock interrupt, then the sys category gets the tick. If the CPU was not busy, a check is made to see if any I/O to disk is in progress. If any disk I/O is in progress, the wio category is incremented. If no disk I/O is in progress and the CPU is not busy, the idle category gets the tick.

The inflated view of wio time results from all idle CPUs being categorized as wio regardless of the number of threads waiting on I/O. For example, systems with just one thread doing I/O could report over 90 percent wio time regardless of the number of CPUs it has.

Method Used in AIX 4.3.3 and Later

The change in AIX 4.3.3 is to only mark an idle CPU as wio if an outstanding I/O was started on that CPU. This method can report much lower wio times when just a few threads are doing I/O and the system is otherwise idle. For example, a system with four CPUs and one thread doing I/O will report a maximum of 25 percent wio time. A system with 12 CPUs and one thread doing I/O will report a maximum of 8.3 percent wio time.

Also, starting with AIX 4.3.3, waiting on I/O to NFS mounted file systems is reported as wait I/O time.

Assessing Disk Performance with the iostat Command

Begin the assessment by running the iostat command with an interval parameter during your system's peak workload period or while running a critical application for which you need to minimize I/O delays. The following shell script runs the iostat command in the background while a copy of a large file runs in the foreground so that there is some I/O to measure:

# iostat 5 3 >io.out &
# cp big1 /dev/null

This example leaves the following three reports in the io.out file:tty: tin tout avg-cpu: % user % sys % idle % iowait 0.0 1.3 0.2 0.6 98.9 0.3 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 0.0 0.3 0.0 29753 48076 hdisk1 0.1 0.1 0.0 11971 26460 hdisk2 0.2 0.8 0.1 91200 108355 cd0 0.0 0.0 0.0 0 0 tty: tin tout avg-cpu: % user % sys % idle % iowait 0.8 0.8 0.6 9.7 50.2 39.5 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 47.0 674.6 21.8 3376 24 hdisk1 1.2 2.4 0.6 0 12 hdisk2 4.0 7.9 1.8 8 32 cd0 0.0 0.0 0.0 0 0 tty: tin tout avg-cpu: % user % sys % idle % iowait 2.0 2.0 0.2 1.8 93.4 4.6 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 0.0 0.0 0.0 0 0 hdisk1 0.0 0.0 0.0 0 0 hdisk2 4.8 12.8 3.2 64 0 cd0 0.0 0.0 0.0 0 0

The first report is the summary since the last reboot and shows the overall balance (or, in this case, imbalance) in the I/O to each of the hard disks. hdisk1 was almost idle and hdisk2 received about 63 percent of the total I/O (from Kb_read and Kb_wrtn).

Note

The system maintains a history of disk activity. If the history is disabled (smitty chgsys -> Continuously maintain DISK I/O history [false]), the following message displays when you run the iostat command:

Disk history since boot not available.

The interval disk I/O statistics are unaffected by this.

The second report shows the 5-second interval during which cp ran. Examine this information carefully. The elapsed time for this cp was about 2.6 seconds. Thus, 2.5 seconds of high I/O dependency are being averaged with 2.5 seconds of idle time to yield the 39.5 percent % iowait reported. A shorter interval would have given a more detailed characterization of the command itself, but this example demonstrates what you must consider when you are looking at reports that show average activity across intervals.

TTY Report

The two columns of TTY information (tin and tout) in the iostat output show the number of characters read and written by all TTY devices. This includes both real and pseudo TTY devices. Real TTY devices are those connected to an asynchronous port. Some pseudo TTY devices are shells, telnet sessions, and aixterm windows.

Because the processing of input and output characters consumes CPU resources, look for a correlation between increased TTY activity and CPU utilization. If such a relationship exists, evaluate ways to improve the performance of the TTY subsystem. Steps that could be taken include changing the application program, modifying TTY port parameters during file transfer, or perhaps upgrading to a faster or more efficient asynchronous communications adapter.

In Shell Script fastport.sh for Fast File Transfers, you can find the fastport.sh script, which is intended to condition a TTY port for fast file transfers in raw mode; for example, when a FAX machine is to be connected. Using the script might improve CPU performance by a factor of 3 at 38400 baud.

CPU Report

The CPU statistics columns (% user, % sys, % idle, and % iowait) provide a breakdown of CPU usage. This information is also reported in the vmstat command output in the columns labeled us, sy, id, and wa. For a detailed explanation for the values, see The vmstat Command. Also note the change made to % iowait described in Wait I/O Time Reporting.

On systems running one application, high I/O wait percentage might be related to the workload. On systems with many processes, some will be running while others wait for I/O. In this case, the % iowait can be small or zero because running processes "hide" some wait time. Although % iowait is low, a bottleneck can still limit application performance.

If the iostat command indicates that a CPU-bound situation does not exist, and % iowait time is greater than 20 percent, you might have an I/O or disk-bound situation. This situation could be caused by excessive paging due to a lack of real memory. It could also be due to unbalanced disk load, fragmented data or usage patterns. For an unbalanced disk load, the same iostat report provides the necessary information. But for information about file systems or logical volumes, which are logical resources, you must use tools such as the filemon or fileplace commands.

Drive Report

When you suspect a disk I/O performance problem, use the iostat command. To avoid the information about the TTY and CPU statistics, use the -d option. In addition, the disk statistics can be limited to the important disks by specifying the disk names.

Remember that the first set of data represents all activity since system startup.

Disks:: Shows the names of the physical volumes. They are either hdisk or cd followed by a number. If physical volume names are specified with the iostat command, only those names specified are displayed.
% tm_act: Indicates the percentage of time that the physical disk was active (bandwidth utilization for the drive) or, in other words, the total time disk requests are outstanding. A drive is active during data transfer and command processing, such as seeking to a new location. The "disk active time" percentage is directly proportional to resource contention and inversely proportional to performance. As disk use increases, performance decreases and response time increases. In general, when the utilization exceeds 70 percent, processes are waiting longer than necessary for I/O to complete because most UNIX processes block (or sleep) while waiting for their I/O requests to complete. Look for busy versus idle drives. Moving data from busy to idle drives can help alleviate a disk bottleneck. Paging to and from disk will contribute to the I/O load.
Kbps: Indicates the amount of data transferred (read or written) to the drive in KB per second. This is the sum of Kb_read plus Kb_wrtn, divided by the seconds in the reporting interval.
tps: Indicates the number of transfers per second that were issued to the physical disk. A transfer is an I/O request through the device driver level to the physical disk. Multiple logical requests can be combined into a single I/O request to the disk. A transfer is of indeterminate size.
Kb_read: Reports the total data (in KB) read from the physical volume during the measured interval.
Kb_wrtn: Shows the amount of data (in KB) written to the physical volume during the measured interval.

Taken alone, there is no unacceptable value for any of the above fields because statistics are too closely related to application characteristics, system configuration, and type of physical disk drives and adapters. Therefore, when you are evaluating data, look for patterns and relationships. The most common relationship is between disk utilization (%tm_act) and data transfer rate (tps).

To draw any valid conclusions from this data, you have to understand the application's disk data access patterns such as sequential, random, or combination, as well as the type of physical disk drives and adapters on the system. For example, if an application reads/writes sequentially, you should expect a high disk transfer rate (Kbps) when you have a high disk busy rate (%tm_act). Columns Kb_read and Kb_wrtn can confirm an understanding of an application's read/write behavior. However, these columns provide no information on the data access patterns.

Generally you do not need to be concerned about a high disk busy rate (%tm_act) as long as the disk transfer rate (Kbps) is also high. However, if you get a high disk busy rate and a low disk transfer rate, you may have a fragmented logical volume, file system, or individual file.

Discussions of disk, logical volume and file system performance sometimes lead to the conclusion that the more drives you have on your system, the better the disk I/O performance. This is not always true because there is a limit to the amount of data that can be handled by a disk adapter. The disk adapter can also become a bottleneck. If all your disk drives are on one disk adapter, and your hot file systems are on separate physical volumes, you might benefit from using multiple disk adapters. Performance improvement will depend on the type of access.

To see if a particular adapter is saturated, use the iostat command and add up all the Kbps amounts for the disks attached to a particular disk adapter. For maximum aggregate performance, the total of the transfer rates (Kbps) must be below the disk adapter throughput rating. In most cases, use 70 percent of the throughput rate. In operating system versions later than 4.3.3 the -a or -A option will display this information.

Assessing Disk Performance with the vmstat Command

To prove that the system is I/O bound, it is better to use the iostat command. However, the vmstat command could point to that direction by looking at the wa column, as discussed in The vmstat Command. Other indicators for I/O bound are:

The disk xfer part of the vmstat output

To display a statistic about the logical disks (a maximum of four disks is allowed), use the following command:

# vmstat hdisk0 hdisk1 1 8
kthr     memory             page              faults        cpu     disk xfer
----  ----------  ----------------------- ------------ ----------- ------
r  b   avm   fre  re  pi  po  fr  sr  cy  in   sy  cs  us sy id wa  1 2 3 4
0  0  3456 27743   0   0   0   0   0   0 131  149  28  0  1 99  0  0 0
0  0  3456 27743   0   0   0   0   0   0 131   77  30  0  1 99  0  0 0
1  0  3498 27152   0   0   0   0   0   0 153 1088  35  1 10 87  2  0 11
0  1  3499 26543   0   0   0   0   0   0 199 1530  38  1 19  0 80  0 59
0  1  3499 25406   0   0   0   0   0   0 187 2472  38  2 26  0 72  0 53
0  0  3456 24329   0   0   0   0   0   0 178 1301  37  2 12 20 66  0 42
0  0  3456 24329   0   0   0   0   0   0 124   58  19  0  0 99  0  0 0
0  0  3456 24329   0   0   0   0   0   0 123   58  23  0  0 99  0  0 0

The disk xfer part provides the number of transfers per second to the specified physical volumes that occurred in the sample interval. One to four physical volume names can be specified. Transfer statistics are given for each specified drive in the order specified. This count represents requests to the physical device. It does not imply an amount of data that was read or written. Several logical requests can be combined into one physical request.

The in column of the vmstat output
This column shows the number of hardware or device interrupts (per second) observed over the measurement interval. Examples of interrupts are disk request completions and the 10 millisecond clock interrupt. Since the latter occurs 100 times per second, the in field is always greater than 100. But the vmstat command also provides a more detailed output about the system interrupts.

The vmstat -i output

The -i parameter displays the number of interrupts taken by each device since system startup. But, by adding the interval and, optionally, the count parameter, the statistic since startup is only displayed in the first stanza; every trailing stanza is a statistic about the scanned interval.

# vmstat -i 1 2
priority level    type   count module(handler)
    0       0   hardware     0 i_misc_pwr(a868c)
    0       1   hardware     0 i_scu(a8680)
    0       2   hardware     0 i_epow(954e0)
    0       2   hardware     0 /etc/drivers/ascsiddpin(189acd4)
    1       2   hardware   194 /etc/drivers/rsdd(1941354)
    3      10   hardware 10589024 /etc/drivers/mpsdd(1977a88)
    3      14   hardware 101947 /etc/drivers/ascsiddpin(189ab8c)
    5      62   hardware 61336129 clock(952c4)
   10      63   hardware 13769 i_softoff(9527c)
priority level    type   count module(handler)
    0       0   hardware     0 i_misc_pwr(a868c)
    0       1   hardware     0 i_scu(a8680)
    0       2   hardware     0 i_epow(954e0)
    0       2   hardware     0 /etc/drivers/ascsiddpin(189acd4)
    1       2   hardware     0 /etc/drivers/rsdd(1941354)
    3      10   hardware    25 /etc/drivers/mpsdd(1977a88)
    3      14   hardware     0 /etc/drivers/ascsiddpin(189ab8c)
    5      62   hardware   105 clock(952c4)
   10      63   hardware     0 i_softoff(9527c)

Note

The output will differ from system to system, depending on hardware and software configurations (for example, the clock interrupts may not be displayed in the vmstat -i output although they will be accounted for under the in column in the normal vmstat output). Check for high numbers in the count column and investigate why this module has to execute so many interrupts.

Assessing Disk Performance with the sar Command

The sar command is a standard UNIX command used to gather statistical data about the system. With its numerous options, the sar command provides queuing, paging, TTY, and many other statistics. With AIX 4.3.3, the sar -d option generates real-time disk I/O statistics.

# sar -d 3 3

AIX konark 3 4 0002506F4C00    08/26/99

12:09:50     device    %busy    avque    r+w/s   blks/s   avwait   avserv

12:09:53     hdisk0      1      0.0        0        5      0.0      0.0
             hdisk1      0      0.0        0        1      0.0      0.0
                cd0      0      0.0        0        0      0.0      0.0

12:09:56     hdisk0      0      0.0        0        0      0.0      0.0
             hdisk1      0      0.0        0        1      0.0      0.0
                cd0      0      0.0        0        0      0.0      0.0

12:09:59     hdisk0      1      0.0        1        4      0.0      0.0
             hdisk1      0      0.0        0        1      0.0      0.0
                cd0      0      0.0        0        0      0.0      0.0


Average      hdisk0      0      0.0        0        3      0.0      0.0
             hdisk1      0      0.0        0        1      0.0      0.0
                cd0      0      0.0        0        0      0.0      0.0

The fields listed by the sar -d command are as follows:

%busy: Portion of time device was busy servicing a transfer request. This is the same as the %tm_act column in the iostat command report.
avque: Average number of requests outstanding from the adapter to the device during that time. There may be additonal I/O operations in the queue of the device driver. This number is a good indicator if an I/O bottleneck exists.
r+w/s: Number of read/write transfers from or to device. This is the same as tps in the iostat command report.
blks/s: Number of bytes transferred in 512-byte units
avwait: Average number of transactions waiting for service (queue length). Average time (in milliseconds) that transfer requests waited idly on queue for the device. This number is currently not reported and shows 0.0 by default.
avserv: Number of milliseconds per average seek. Average time (in milliseconds) to service each transfer request (includes seek, rotational latency, and data transfer times) for the device. This number is currently not reported and shows 0.0 by default.

Assessing Logical Volume Fragmentation with the lslv Command

The lslv command shows, among other information, the logical volume fragmentation. To check logical volume fragmentation, use the command lslv -l lvname, as follows:

# lslv -l hd2
hd2:/usr
PV                COPIES        IN BAND       DISTRIBUTION
hdisk0            114:000:000   22%           000:042:026:000:046

The output of COPIES shows the logical volume hd2 has only one copy. The IN BAND shows how well the intrapolicy, an attribute of logical volumes, is followed. The higher the percentage, the better the allocation efficiency. Each logical volume has its own intrapolicy. If the operating system cannot meet this requirement, it chooses the best way to meet the requirements. In our example, there are a total of 114 logical partitions (LP); 42 LPs are located on middle, 26 LPs on center, and 46 LPs on inner-edge. Since the logical volume intrapolicy is center, the in-band is 22 percent (26 / (42+26+46). The DISTRIBUTION shows how the physical partitions are placed in each part of the intrapolicy; that is:

edge : middle : center : inner-middle : inner-edge

See Position on Physical Volume for additional information about physical partitions placement.

Assessing Physical Placement of Data with the lslv Command

If the workload shows a significant degree of I/O dependency, you can investigate the physical placement of the files on the disk to determine if reorganization at some level would yield an improvement. To see the placement of the partitions of logical volume hd11 within physical volume hdisk0, use the following:

# lslv -p hdisk0 hd11
hdisk0:hd11:/home/op
USED  USED  USED  USED  USED  USED  USED  USED  USED  USED     1-10
USED  USED  USED  USED  USED  USED  USED                      11-17

USED  USED  USED  USED  USED  USED  USED  USED  USED  USED    18-27
USED  USED  USED  USED  USED  USED  USED                      28-34

USED  USED  USED  USED  USED  USED  USED  USED  USED  USED    35-44
USED  USED  USED  USED  USED  USED                            45-50

USED  USED  USED  USED  USED  USED  USED  USED  USED  USED    51-60
0052  0053  0054  0055  0056  0057  0058                      61-67

0059  0060  0061  0062  0063  0064  0065  0066  0067  0068    68-77
0069  0070  0071  0072  0073  0074  0075                      78-84

Look for the rest of hd11 on hdisk1 with the following:

# lslv -p hdisk1 hd11
hdisk1:hd11:/home/op
0035  0036  0037  0038  0039  0040  0041  0042  0043  0044    1-10
0045  0046  0047  0048  0049  0050  0051                     11-17

USED  USED  USED  USED  USED  USED  USED  USED  USED  USED   18-27
USED  USED  USED  USED  USED  USED  USED                     28-34

USED  USED  USED  USED  USED  USED  USED  USED  USED  USED   35-44
USED  USED  USED  USED  USED  USED                           45-50

0001  0002  0003  0004  0005  0006  0007  0008  0009  0010   51-60
0011  0012  0013  0014  0015  0016  0017                     61-67

0018  0019  0020  0021  0022  0023  0024  0025  0026  0027   68-77
0028  0029  0030  0031  0032  0033  0034                     78-84

From top to bottom, five blocks represent edge, middle, center, inner-middle, and inner-edge, respectively.

A USED indicates that the physical partition at this location is used by a logical volume other than the one specified. A number indicates the logical partition number of the logical volume specified with the lslv -p command.
A FREE indicates that this physical partition is not used by any logical volume. Logical volume fragmentation occurs if logical partitions are not contiguous across the disk.
A STALE physical partition is a physical partition that contains data you cannot use. You can also see the STALE physical partitions with the lspv -m command. Physical partitions marked as STALE must be updated to contain the same information as valid physical partitions. This process, called resynchronization with the syncvg command, can be done at vary-on time, or can be started anytime the system is running. Until the STALE partitions have been rewritten with valid data, they are not used to satisfy read requests, nor are they written to on write requests.

In the previous example, logical volume hd11 is fragmented within physical volume hdisk1, with its first logical partitions in the inner-middle and inner regions of hdisk1, while logical partitions 35-51 are in the outer region. A workload that accessed hd11 randomly would experience unnecessary I/O wait time as longer seeks might be needed on logical volume hd11. These reports also indicate that there are no free physical partitions in either hdisk0 or hdisk1.

Assessing File Placement with the fileplace Command

To see how the file copied earlier, big1, is stored on the disk, we can use the fileplace command. The fileplace command displays the placement of a file's blocks within a logical volume or within one or more physical volumes.

To determine whether the fileplace command is installed and available, run the following command:

# lslpp -lI perfagent.tools

Use the following command:

# fileplace -pv big1

File: big1  Size: 3554273 bytes  Vol: /dev/hd10
Blk Size: 4096  Frag Size: 4096  Nfrags: 868   Compress: no
Inode: 19  Mode: -rwxr-xr-x  Owner: hoetzel  Group: system

  Physical Addresses (mirror copy 1)                            Logical Fragment
  ----------------------------------                            ----------------
  0001584-0001591  hdisk0     8 frags    32768 Bytes,   0.9%    0001040-0001047
  0001624-0001671  hdisk0    48 frags   196608 Bytes,   5.5%    0001080-0001127
  0001728-0002539  hdisk0   812 frags  3325952 Bytes,  93.5%    0001184-0001995

  868 frags over space of 956 frags:  space efficiency = 90.8%
  3 fragments out of 868 possible:  sequentiality = 99.8%

This example shows that there is very little fragmentation within the file, and those are small gaps. We can therefore infer that the disk arrangement of big1 is not significantly affecting its sequential read-time. Further, given that a (recently created) 3.5 MB file encounters this little fragmentation, it appears that the file system in general has not become particularly fragmented.

Occasionally, portions of a file may not be mapped to any blocks in the volume. These areas are implicitly filled with zeroes by the file system. These areas show as unallocated logical blocks. A file that has these holes will show the file size to be a larger number of bytes than it actually occupies (that is, the ls -l command will show a large size, whereas the du command will show a smaller size or the number of blocks the file really occupies on disk).

The fileplace command reads the file's list of blocks from the logical volume. If the file is new, the information may not be on disk yet. Use the sync command to flush the information. Also, the fileplace command will not display NFS remote files (unless the command runs on the server).

Note

If a file has been created by seeking to various locations and writing widely dispersed records, only the pages that contain records will take up space on disk and appear on a fileplace report. The file system does not fill in the intervening pages automatically when the file is created. However, if such a file is read sequentially (by the cp or tar commands, for example) the space between records is read as binary zeroes. Thus, the output of such a cp command can be much larger than the input file, although the data is the same.

Space Efficiency and Sequentiality

Higher space efficiency means files are less fragmented and probably provide better sequential file access. A higher sequentiality indicates that the files are more contiguously allocated, and this will probably be better for sequential file access.

Space efficiency =: Total number of fragments used for file storage /
(Largest fragment physical address -
Smallest fragment physical address + 1)
Sequentiality =: (Total number of fragments -
Number of grouped fragments +1) /
Total number of fragments

If you find that your sequentiality or space efficiency values become low, you can use the reorgvg command to improve logical volume utilization and efficiency (see Reorganizing Logical Volumes). To improve file system utilization and efficiency, see Reorganizing File Systems.

In this example, the Largest fragment physical address - Smallest fragment physical address + 1 is: 0002539 - 0001584 + 1 = 956 fragments; total used fragments is: 8 + 48 + 812 = 868; the space efficiency is 868 / 956 (90.8 percent); the sequentiality is (868 - 3 + 1) / 868 = 99.8 percent.

Because the total number of fragments used for file storage does not include the indirect blocks location, but the physical address does, the space efficiency can never be 100 percent for files larger than 32 KB, even if the file is located on contiguous fragments.

Assessing Paging Space I/O with the vmstat Command

I/O to and from paging spaces is random, mostly one page at a time. The vmstat reports indicate the amount of paging-space I/O taking place. Both of the following examples show the paging activity that occurs during a C compilation in a machine that has been artificially shrunk using the rmss command. The pi and po (paging-space page-ins and paging-space page-outs) columns show the amount of paging-space I/O (in terms of 4096-byte pages) during each 5-second interval. The first report (summary since system reboot) has been removed. Notice that the paging activity occurs in bursts.

# vmstat 5 8
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 0  1 72379   434   0   0   0   0    2   0 376  192 478  9  3 87  1
 0  1 72379   391   0   8   0   0    0   0 631 2967 775 10  1 83  6
 0  1 72379   391   0   0   0   0    0   0 625 2672 790  5  3 92  0
 0  1 72379   175   0   7   0   0    0   0 721 3215 868  8  4 72 16
 2  1 71384   877   0  12  13  44  150   0 662 3049 853  7 12 40 41
 0  2 71929   127   0  35  30 182  666   0 709 2838 977 15 13  0 71
 0  1 71938   122   0   0   8  32  122   0 608 3332 787 10  4 75 11
 0  1 71938   122   0   0   0   3   12   0 611 2834 733  5  3 75 17

The following "before and after" vmstat -s reports show the accumulation of paging activity. Remember that it is the paging space page ins and paging space page outs that represent true paging-space I/O. The (unqualified) page ins and page outs report total I/O, that is both paging-space I/O and the ordinary file I/O, performed by the paging mechanism. The reports have been edited to remove lines that are irrelevant to this discussion.

# vmstat -s # before	# vmstat -s # after
6602 page ins 3948 page outs 544 paging space page ins 1923 paging space page outs 0 total reclaims	7022 page ins 4146 page outs 689 paging space page ins 2032 paging space page outs 0 total reclaims

The fact that more paging-space page-ins than page-outs occurred during the compilation suggests that we had shrunk the system to the point that thrashing begins. Some pages were being repaged because their frames were stolen before their use was complete.

Assessing Overall Disk I/O with the vmstat Command

The technique just discussed can also be used to assess the disk I/O load generated by a program. If the system is otherwise idle, the following sequence:

# vmstat -s >statout
# testpgm
# sync
# vmstat -s >> statout
# egrep "ins|outs" statout

yields a before and after picture of the cumulative disk activity counts, such as:

     5698 page ins
     5012 page outs
        0 paging space page ins
       32 paging space page outs
     6671 page ins
     5268 page outs
        8 paging space page ins
      225 paging space page outs

During the period when this command (a large C compile) was running, the system read a total of 981 pages (8 from paging space) and wrote a total of 449 pages (193 to paging space).

Detailed I/O Analysis with the filemon Command

The filemon command uses the trace facility to obtain a detailed picture of I/O activity during a time interval on the various layers of file system utilization, including the logical file system, virtual memory segments, LVM, and physical disk layers. Data can be collected on all the layers, or layers can be specified with the -O layer option. The default is to collect data on the VM, LVM, and physical layers. Both summary and detailed reports are generated. Since it uses the trace facility, the filemon command can be run only by the root user or by a member of the system group.

To determine whether the filemon command is installed and available, run the following command:

# lslpp -lI perfagent.tools

Tracing is started by the filemon command, optionally suspended with the trcoff subcommand and resumed with the trcon subcomand, and terminated with the trcstop subcommand (you may want to issue the command nice -n -20 trcstop to stop the filemon command since the filemon command is currently running at priority 40). As soon as tracing is terminated, the filemon command writes its report to stdout.

Note

Only data for those files opened after the filemon command was started will be collected, unless you specify the -u flag.

The filemon command can read the I/O trace data from a specified file, instead of from the real-time trace process. In this case, the filemon report summarizes the I/O activity for the system and period represented by the trace file. This offline processing method is useful when it is necessary to postprocess a trace file from a remote machine or perform the trace data collection at one time and postprocess it at another time.

The trcrpt -r command must be executed on the trace logfile and redirected to another file, as follows:

# gennames > gennames.out
# trcrpt -r   trace.out > trace.rpt

At this point an adjusted trace logfile is fed into the filemon command to report on I/O activity captured by a previously recorded trace session as follows:

# filemon -i trace.rpt -n gennames.out | pg

In this example, the filemon command reads file system trace events from the input file trace.rpt. Because the trace data is already captured on a file, the filemon command does not put itself in the background to allow application programs to be run. After the entire file is read, an I/O activity report for the virtual memory, logical volume, and physical volume levels is displayed on standard output (which, in this example, is piped to the pg command).

If the trace command was run with the -C all flag, then run the trcrpt command also with the -C all flag (see Formatting a Report from trace -C Output).

The following sequence of commands gives an example of the filemon command usage:

# filemon -o fm.out -O all; cp /smit.log /dev/null ; trcstop

The report produced by this sequence, in an otherwise-idle system, is as follows:

Thu Aug 19 11:30:49 1999
System: AIX texmex Node: 4 Machine: 000691854C00

0.369 secs in measured interval
Cpu utilization:  9.0%

Most Active Files
------------------------------------------------------------------------
  #MBs  #opns   #rds   #wrs  file                 volume:inode
------------------------------------------------------------------------
   0.1      1     14      0  smit.log             /dev/hd4:858
   0.0      1      0     13  null
   0.0      2      4      0  ksh.cat              /dev/hd2:16872
   0.0      1      2      0  cmdtrace.cat         /dev/hd2:16739

Most Active Segments
------------------------------------------------------------------------
  #MBs  #rpgs  #wpgs  segid  segtype              volume:inode
------------------------------------------------------------------------
   0.1     13      0   5e93  ???
   0.0      2      0   22ed  ???
   0.0      1      0   5c77  persistent

Most Active Logical Volumes
------------------------------------------------------------------------
  util  #rblk  #wblk   KB/s  volume               description
------------------------------------------------------------------------
  0.06    112      0  151.9  /dev/hd4             /
  0.04     16      0   21.7  /dev/hd2             /usr

Most Active Physical Volumes
------------------------------------------------------------------------
  util  #rblk  #wblk   KB/s  volume               description
------------------------------------------------------------------------
  0.10    128      0  173.6  /dev/hdisk0          N/A




------------------------------------------------------------------------
Detailed File Stats
------------------------------------------------------------------------

FILE: /smit.log  volume: /dev/hd4 (/)  inode: 858
opens:                  1
total bytes xfrd:       57344
reads:                  14      (0 errs)
  read sizes (bytes):   avg  4096.0 min    4096 max    4096 sdev     0.0
  read times (msec):    avg   1.709 min   0.002 max  19.996 sdev   5.092

FILE: /dev/null
opens:                  1
total bytes xfrd:       50600
writes:                 13      (0 errs)
  write sizes (bytes):  avg  3892.3 min    1448 max    4096 sdev   705.6
  write times (msec):   avg   0.007 min   0.003 max   0.022 sdev   0.006

FILE: /usr/lib/nls/msg/en_US/ksh.cat  volume: /dev/hd2 (/usr)  inode: 16872
opens:                  2
total bytes xfrd:       16384
reads:                  4       (0 errs)
  read sizes (bytes):   avg  4096.0 min    4096 max    4096 sdev     0.0
  read times (msec):    avg   0.042 min   0.015 max   0.070 sdev   0.025
lseeks:                 10

FILE: /usr/lib/nls/msg/en_US/cmdtrace.cat  volume: /dev/hd2 (/usr)  inode: 16739
opens:                  1
total bytes xfrd:       8192
reads:                  2       (0 errs)
  read sizes (bytes):   avg  4096.0 min    4096 max    4096 sdev     0.0
  read times (msec):    avg   0.062 min   0.049 max   0.075 sdev   0.013
lseeks:                 8

------------------------------------------------------------------------
Detailed VM Segment Stats   (4096 byte pages)
------------------------------------------------------------------------

SEGMENT: 5e93  segtype: ???
segment flags:
reads:                  13      (0 errs)
  read times (msec):    avg   1.979 min   0.957 max   5.970 sdev   1.310
  read sequences:       1
  read seq. lengths:    avg    13.0 min      13 max      13 sdev     0.0

SEGMENT: 22ed  segtype: ???
segment flags:          inode
reads:                  2       (0 errs)
  read times (msec):    avg   8.102 min   7.786 max   8.418 sdev   0.316
  read sequences:       2
  read seq. lengths:    avg     1.0 min       1 max       1 sdev     0.0

SEGMENT: 5c77  segtype: persistent
segment flags:          pers defer
reads:                  1       (0 errs)
  read times (msec):    avg  13.810 min  13.810 max  13.810 sdev   0.000
  read sequences:       1
  read seq. lengths:    avg     1.0 min       1 max       1 sdev     0.0

------------------------------------------------------------------------
Detailed Logical Volume Stats   (512 byte blocks)
------------------------------------------------------------------------

VOLUME: /dev/hd4  description: /
reads:                  5       (0 errs)
  read sizes (blks):    avg    22.4 min       8 max      40 sdev    12.8
  read times (msec):    avg   4.847 min   0.938 max  13.792 sdev   4.819
  read sequences:       3
  read seq. lengths:    avg    37.3 min       8 max      64 sdev    22.9
seeks:                  3       (60.0%)
  seek dist (blks):     init   6344,
                        avg    40.0 min       8 max      72 sdev    32.0
time to next req(msec): avg  70.473 min   0.224 max 331.020 sdev 130.364
throughput:             151.9 KB/sec
utilization:            0.06

VOLUME: /dev/hd2  description: /usr
reads:                  2       (0 errs)
  read sizes (blks):    avg     8.0 min       8 max       8 sdev     0.0
  read times (msec):    avg   8.078 min   7.769 max   8.387 sdev   0.309
  read sequences:       2
  read seq. lengths:    avg     8.0 min       8 max       8 sdev     0.0
seeks:                  2       (100.0%)
  seek dist (blks):     init 608672,
                        avg    16.0 min      16 max      16 sdev     0.0
time to next req(msec): avg 162.160 min   8.497 max 315.823 sdev 153.663
throughput:             21.7 KB/sec
utilization:            0.04

------------------------------------------------------------------------
Detailed Physical Volume Stats   (512 byte blocks)
------------------------------------------------------------------------

VOLUME: /dev/hdisk0  description: N/A
reads:                  7       (0 errs)
  read sizes (blks):    avg    18.3 min       8 max      40 sdev    12.6
  read times (msec):    avg   5.723 min   0.905 max  20.448 sdev   6.567
  read sequences:       5
  read seq. lengths:    avg    25.6 min       8 max      64 sdev    22.9
seeks:                  5       (71.4%)
  seek dist (blks):     init 4233888,
                        avg 171086.0 min       8 max  684248 sdev 296274.2
  seek dist (%tot blks):init 48.03665,
                        avg 1.94110 min 0.00009 max 7.76331 sdev 3.36145
time to next req(msec): avg  50.340 min   0.226 max 315.865 sdev 108.483
throughput:             173.6 KB/sec
utilization:            0.10

Using the filemon command in systems with real workloads would result in much larger reports and might require more trace buffer space. Space and CPU time consumption for the filemon command can degrade system performance to some extent. Use a nonproduction system to experiment with the filemon command before starting it in a production environment. Also, use offline processing and on systems with many CPUs use the -C all flag with the trace command.

Note

Although the filemon command reports average, minimum, maximum, and standard deviation in its detailed-statistics sections, the results should not be used to develop confidence intervals or other formal statistical inferences. In general, the distribution of data points is neither random nor symmetrical.

Global Reports of the filemon Command

The global reports list the most active files, segments, logical volumes, and physical volumes during the measured interval. They are shown at the beginning of the filemon report. By default, the logical file and virtual memory reports are limited to the 20 most active files and segments, respectively, as measured by the total amount of data transferred. If the -v flag has been specified, activity for all files and segments is reported. All information in the reports is listed from top to bottom as most active to least active.

Most Active Files

#MBs: Total number of MBs transferred over measured interval for this file. The rows are sorted by this field in decreasing order.
#opns: Number of opens for files during measurement period.
#rds: Number of read calls to file.
#wrs: Number of write calls to file.
file: File name (full path name is in detailed report).
volume:inode: The logical volume that the file resides in and the i-node number of the file in the associated file system. This field can be used to associate a file with its corresponding persistent segment shown in the detailed VM segment reports. This field may be blank for temporary files created and deleted during execution.

The most active files are smit.log on logical volume hd4 and file null. The application utilizes the terminfo database for screen management; so the ksh.cat and cmdtrace.cat are also busy. Anytime the shell needs to post a message to the screen, it uses the catalogs for the source of the data.

To identify unknown files, you can translate the logical volume name, /dev/hd1, to the mount point of the file system, /home, and use the find or the ncheck command:

# find / -inum 858 -print
/smit.log

# ncheck -i 858 /
/:
858     /smit.log

Most Active Segments

#MBs: Total number of MBs transferred over measured interval for this segment. The rows are sorted by this field in decreasing order.
#rpgs: Number of 4-KB pages read into segment from disk.
#wpgs: Number of 4-KB pages written from segment to disk (page out).
#segid: VMM ID of memory segment.
segtype: Type of segment: working segment, persistent segment (local file), client segment (remote file), page table segment, system segment, or special persistent segments containing file system data (log, root directory, .inode, .inodemap, .inodex, .inodexmap, .indirect, .diskmap).
volume:inode: For persistent segments, name of logical volume that contains the associated file and the file's i-node number. This field can be used to associate a persistent segment with its corresponding file, shown in the Detailed File Stats reports. This field is blank for nonpersistent segments.

If the command is still active, the virtual memory analysis tool svmon can be used to display more information about a segment, given its segment ID (segid), as follows: svmon -D segid. See The svmon Command for a detailed discussion.

In our example, the segtype ??? means that the system cannot identify the segment type, and you must use the svmon command to get more information.

Most Active Logical Volumes

util: Utilization of logical volume.
#rblk: Number of 512-byte blocks read from logical volume.
#wblk: Number of 512-byte blocks written to logical volume.
KB/s: Average transfer data rate in KB per second.
volume: Logical volume name.
description: Either the file system mount point or the logical volume type (paging, jfslog, boot, or sysdump). For example, the logical volume /dev/hd2 is /usr; /dev/hd6 is paging, and /dev/hd8 is jfslog. There may also be the word compressed. This means all data is compressed automatically using Lempel-Zev (LZ) compression before being written to disk, and all data is uncompressed automatically when read from disk (see Compression for details).

The utilization is presented in percentage, 0.06 indicates 6 percent busy during measured interval.

Most Active Physical Volumes

util: Utilization of physical volume.
Note

Logical volume I/O requests start before and end after physical volume I/O requests. Total logical volume utilization will appear therefore to be higher than total physical volume utilization.
#rblk: Number of 512-byte blocks read from physical volume.
#wblk: Number of 512-byte blocks written to physical volume.
KB/s: Average transfer data rate in KB per second.
volume: Physical volume name.
description: Simple description of the physical volume type, for example, SCSI Multimedia CD-ROM Drive or 16 Bit SCSI Disk Drive.

The utilization is presented in percentage, 0.10 indicates 10 percent busy during measured interval.

Detailed Reports of the filemon Command

The detailed reports give additional information for the global reports. There is one entry for each reported file, segment, or volume in the detailed reports. The fields in each entry are described below for the four detailed reports. Some of the fields report a single value; others report statistics that characterize a distribution of many values. For example, response-time statistics are kept for all read or write requests that were monitored. The average, minimum, and maximum response times are reported, as well as the standard deviation of the response times. The standard deviation is used to show how much the individual response times deviated from the average. Approximately two-thirds of the sampled response times are between average minus standard deviation (avg - sdev) and average plus standard deviation (avg + sdev). If the distribution of response times is scattered over a large range, the standard deviation will be large compared to the average response time.

Detailed File Stats

Detailed file statistics are provided for each file listed in the Most Active Files report. These stanzas can be used to determine what access has been made to the file. In addition to the number of total bytes transferred, opens, reads, writes, and lseeks, the user can also determine the read/write size and times.

FILE: Name of the file. The full path name is given, if possible.
volume: Name of the logical volume/file system containing the file.
inode: I-node number for the file within its file system.
opens: Number of times the file was opened while monitored.
total bytes xfrd: Total number of bytes read/written from/to the file.
reads: Number of read calls against the file.
read sizes (bytes): Read transfer-size statistics (avg/min/max/sdev), in bytes.
read times (msec): Read response-time statistics (avg/min/max/sdev), in milliseconds.
writes: Number of write calls against the file.
write sizes (bytes): Write transfer-size statistics.
write times (msec): Write response-time statistics.
lseeks: Number of lseek() subroutine calls.

The read sizes and write sizes will give you an idea of how efficiently your application is reading and writing information. Use a multiple of 4 KB pages for best results.

Detailed VM Segment Stats

Each element listed in the Most Active Segments report has a corresponding stanza that shows detailed information about real I/O to and from memory.

SEGMENT: Internal operating system's segment ID.
segtype: Type of segment contents.
segment flags: Various segment attributes.
volume: For persistent segments, the name of the logical volume containing the corresponding file.
inode: For persistent segments, the i-node number for the corresponding file.
reads: Number of 4096-byte pages read into the segment (that is, paged in).
read times (msec): Read response-time statistics (avg/min/max/sdev), in milliseconds.
read sequences: Number of read sequences. A sequence is a string of pages that are read (paged in) consecutively. The number of read sequences is an indicator of the amount of sequential access.
read seq. lengths: Statistics describing the lengths of the read sequences, in pages.
writes: Number of pages written from the segment to disk (that is, paged out).
write times (msec): Write response-time statistics.
write sequences: Number of write sequences. A sequence is a string of pages that are written (paged out) consecutively.
write seq. lengths: Statistics describing the lengths of the write sequences, in pages.

By examining the reads and read-sequence counts, you can determine if the access is sequential or random. For example, if the read-sequence count approaches the reads count, the file access is more random. On the other hand, if the read-sequence count is significantly smaller than the read count and the read-sequence length is a high value, the file access is more sequential. The same logic applies for the writes and write sequence.

Detailed Logical/Physical Volume Stats

Each element listed in the Most Active Logical Volumes / Most Active Physical Volumes reports will have a corresponding stanza that shows detailed information about the logical/physical volume. In addition to the number of reads and writes, the user can also determine read and write times and sizes, as well as the initial and average seek distances for the logical / physical volume.

VOLUME: Name of the volume.
description: Description of the volume. (Describes contents, if dealing with a logical volume; describes type, if dealing with a physical volume.)
reads: Number of read requests made against the volume.
read sizes (blks): Read transfer-size statistics (avg/min/max/sdev), in units of 512-byte blocks.
read times (msec): Read response-time statistics (avg/min/max/sdev), in milliseconds.
read sequences: Number of read sequences. A sequence is a string of 512-byte blocks that are read consecutively. It indicates the amount of sequential access.
read seq. lengths: Statistics describing the lengths of the read sequences, in blocks.
writes: Number of write requests made against the volume.
write sizes (blks): Write transfer-size statistics.
write times (msec): Write-response time statistics.
write sequences: Number of write sequences. A sequence is a string of 512-byte blocks that are written consecutively.
write seq. lengths: Statistics describing the lengths of the write sequences, in blocks.
seeks: Number of seeks that preceded a read or write request; also expressed as a percentage of the total reads and writes that required seeks.
seek dist (blks): Seek-distance statistics in units of 512-byte blocks. In addition to the usual statistics (avg/min/max/sdev), the distance of the initial seek operation (assuming block 0 was the starting position) is reported separately. This seek distance is sometimes very large; it is reported separately to avoid skewing the other statistics.
seek dist (cyls): (Physical volume only) Seek-distance statistics in units of disk cylinders.
time to next req: Statistics (avg/min/max/sdev) describing the length of time, in milliseconds, between consecutive read or write requests to the volume. This column indicates the rate at which the volume is being accessed.
throughput: Total volume throughput in KB per second.
utilization: Fraction of time the volume was busy. The entries in this report are sorted by this field in decreasing order.

A long seek time can increase I/O response time and result in decreased application performance. By examining the reads and read sequence counts, you can determine if the access is sequential or random. The same logic applies to the writes and write sequence.

Guidelines for Using the filemon Command

Following are some guidelines for using the filemon command:

The /etc/inittab file is always very active. Daemons specified in /etc/inittab are checked regularly to determine whether they are required to be respawned.
The /etc/passwd file is also always very active. Because files and directories access permissions are checked.
A long seek time increases I/O response time and decreases performance.
If the majority of the reads and writes require seeks, you might have fragmented files and overly active file systems on the same physical disk. However, for online transaction processing (OLTP) or database systems this behavior might be normal.
If the number of reads and writes approaches the number of sequences, physical disk access is more random than sequential. Sequences are strings of pages that are read (paged in) or written (paged out) consecutively. The seq. lengths is the length, in pages, of the sequences. A random file access can also involve many seeks. In this case, you cannot distinguish from the filemon output if the file access is random or if the file is fragmented. Use the fileplace command to investigate further.
Remote files are listed in the volume:inode column with the remote system name.

Because the filemon command can potentially consume some CPU power, use this tool with discretion, and analyze the system performance while taking into consideration the overhead involved in running the tool. Tests have shown that in a CPU-saturated environment:

With little I/O, the filemon command slowed a large compile by about one percent.
With a high disk-output rate, the filemon command slowed the writing program by about five percent.

Summary for Monitoring Disk I/O

In general, a high % iowait indicates that the system has an application problem, a memory shortage, or an inefficient I/O subsystem configuration. For example, the application problem might be due to requesting a lot of I/O, but not doing much with the data. Understanding the I/O bottleneck and improving the efficiency of the I/O subsystem is the key in solving this bottleneck. Disk sensitivity can come in a number of forms, with different resolutions. Some typical solutions might include:

Limiting number of active logical volumes and file systems placed on a particular physical disk. The idea is to balance file I/O evenly across all physical disk drives.
Spreading a logical volume across multiple physical disks. This is particularly useful when a number of different files are being accessed.
Creating multiple Journaled File Systems (JFS) logs for a volume group and assigning them to specific file systems (preferably on fast write cache devices). This is beneficial for applications that create, delete, or modify a large number of files, particularly temporary files.
If the iostat output indicates that your workload I/O activity is not evenly distributed among the system disk drives, and the utilization of one or more disk drives is often 70-80 percent or more, consider reorganizing file systems, such as backing up and restoring file systems to reduce fragmentation. Fragmentation causes the drive to seek excessively and can be a large portion of overall response time.
If large, I/O-intensive background jobs are interfering with interactive response time, you may want to activate I/O pacing.
If it appears that a small number of files are being read over and over again, consider whether additional real memory would allow those files to be buffered more effectively.
If the workload's access pattern is predominantly random, you might consider adding disks and distributing the randomly accessed files across more drives.
If the workload's access pattern is predominantly sequential and involves multiple disk drives, you might consider adding one or more disk adapters. It may also be appropriate to consider building a striped logical volume to accommodate large, performance-critical sequential files.
Using fast write cache devices.
Using asynchronous I/O.

Each technique is discussed later in this chapter.