Date: July 6, 2000
The following files pertain to RS/6000 performance monitoring. The diag_perf_bottleneck file discusses how to identify AIX performance bottlenecks, the iostat and vmstat file describe how to intrepret their respective output, and the vmm file describes how AIX manages memory.
These, and other useful technical information, can be found at:
http://techsupport.services.ibm.com/rs6k/techbrowse /
------------------------------------------------------------------------ Contents About this document Related information Memory bottlenecks CPU bottlenecks I/O bottlenecks SMP performance tuning Tuning methodology Additional information ------------------------------------------------------------------------ About this document This document describes how to check for resource bottlenecks and identify the processes that tax them. Resources on a system include memory, CPU, and Input/Output (I/O). This document covers bottlenecks across an entire system. This document does not address the bottlenecks of a particular application or general network problems. The following commands are described: * vmstat * svmon * ps * tprof * iostat * netpmon * filemon NOTE: PAIDE/6000 must be installed in order to use tprof, svmon, netpmon, and filemon. To check if this is installed, run the following command: lslpp -l perfagent.tools. If you are at AIX Version 4.3.0 or higher, PAIDE/6000 can be found on the AIX Base Operating System media. Otherwise, to order PAIDE/6000, contact your local IBM representative. This fax also makes reference to the vmtune and schedtune commands. These commands and their source are found in the /usr/samples/kernel directory. They are installed with the bos.adt.samples fileset. Related information Consult Line Performance Analysis - The AIX Support Family offers a system analysis with tuning recommendations. For more information call the IBM AIX Support Center. Performance Tuning Guide (SC23-2365) - This IBM publication covers performance monitoring and tuning of AIX systems. Contact your local IBM representative to order. For detailed system usage on a per process basis, a free utlity called UTLD can be obtained by anonymous ftp from ftp.software.ibm.com in the /aix/tools/perftools/utld directory. For more information see the README file /usr/lpp/utld after installation of the utld.obj fileset. ------------------------------------------------------------------------ Memory bottlenecks The following section describes memory bottleneck solutions with the following commands: vmstat, svmon, ps. 1. vmstat Run the following command: vmstat 1 NOTE: System may slow down when pi and po are consistently non-zero. pi number of pages per second paged in from paging space po number of pages per second paged out to paging space When processes on the system require more pages of memory than are available in RAM, working pages may be paged out to paging space and then paged in when they are needed again. Accessing a page from paging space is considerably slower than accessing a page directly from RAM. For this reason, constant paging activity can cause system performance degradation. NOTE: Memory is over-committed when the fr:sr ratio is high. fr number of pages that must be freed to replenish the free list or to accommodate an active process sr number of pages that must be examined in order to free fr number of pages An fr:sr ratio of 1:4 means for every one page freed, four pages must be examined. It is difficult to determine a memory constraint based on this ratio alone and what constitutes a high ratio is workload/application dependent. The system considers itself to be thrashing when po*SYS > fr where SYS is a system parameter viewed with the schedtune command. The default value is 0 if a system has 128MB or more which means that memory load control is disabled. Otherwise, the default is 6. Thrashing occurs when the system spends more time paging than performing work. When this occurs, selected processes may be suspended temporarily, and the system may be noticeably slower. 2. svmon As root, run the following command: # svmon -Pau 10 | more Sample output: Pid Command Inuse Pin Pgspace 13794 dtwm 1603 1 449 Pid: 13794 Command: dtwm Segid Type Description Inuse Pin Pgspace Address Range b23 pers /dev/hd2:24849 2 0 0 0..1 14a5 pers /dev/hd2:24842 0 0 0 0..2 6179 work lib data 131 0 98 0..891 280a work shared library text 1101 0 10 0..65535 181 work private 287 1 341 0..310:65277..65535 57d5 pers code,/dev/hd2:61722 82 0 0 0..135 This command lists the top ten memory using processes and gives a report about each one. In each process report, look where Type = work and Description = private. Check how many 4K (4096 byte) pages are used under the Pgspace column. This is the minimum number of working pages this segment is using in all of virtual memory. A Pgspace number that grows, but never decreases, may indicate a memory leak. Memory leaks occur when an application fails to deallocate memory. 341 * 4096 = 1,396,736 or 1.4MB of virtual memory 3. ps Run the following command: ps gv | head -n 1; ps gv | egrep -v "RSS" | sort +6b -7 -n -r size amount of memory in KB allocated from page space for the memory segment of Type = work and Description = private for the process as would be indicated by svmon RSS amount of memory in KB currently in use (in RAM) for the memory segment of Type = work and Description = private plus the memory segment(s) of Type = pers and Description = code for the process as would be indicated by svmon trs amount of memory, in KB, currently in use (in RAM) for the memory segment(s) of Type = pers and Description = code for the process as would be indicated by svmon %mem RSS value divided by the total amount of system RAM in KB multiplied by 100 ------------------------------------------------------------------------ CPU bottlenecks The following section describes CPU bottleneck solutions using the following commands: vmstat, tprof, ps. 1. vmstat Run the following command: vmstat 1 NOTE: System may slow down when processes wait on the run queue. id percentage of time the CPU is idle r number of threads on the run queue If the id value is consistently 0%, the CPU is being used 100% of the time. Look next at the r column to see how many threads are placed on the run queue per second. The higher the number of threads forced to wait on the run queue, the more system performance will suffer. 2. tprof To find out how much CPU time a process is using, run the following command as root: # tprof -x sleep 30 This returns in 30 seconds and creates a file in the current directory called __prof.all. In 30 seconds, the CPU is checked approximately 3000 times. The Total column is the number of times a process was found in the CPU. If one process has 1500 in the Total column, this process has taken 1500/3000 or half of the CPU time. The tprof output explains exactly what processes the CPU has been running. The wait process runs when no other processes require the CPU and accounts for the amount of idle time on the system. 3. netpmon To find out how much CPU time a process is using, and how much of that time is spent executing network-related code, run the following command as root: # netpmon -o /tmp/netpmon.out -O cpu -v;sleep 30;trcstop This returns in 30 seconds and creates a file in the /tmp directory called netpmon.out. The CPUTime indicates the total amount of CPU time for the process. %CPU is the percentage of CPU usage for the process, and Network CPU% is the percentage of total time that the process spent executing network-related code. 4. ps Run the following commands: ps -ef | head -n 1 ps -ef | egrep -v "UID|0:00|\ 0\ " | sort +3b -4 -n -r Check the C column for a process penalty for recent CPU usage. The maximum value for this column is 120. ps -e | head -n 1 ps -e | egrep -v "TIME|0:" | sort +2b -3 -n -r Check the Time column for process accumulated CPU time. ps gu ps gu | egrep -v "CPU|kproc" | sort +2b -3 -n -r Check the %CPU column for process CPU dependency. The percent CPU is the total CPU time divided by the the total elapsed time since the process was started. ------------------------------------------------------------------------ I/0 bottlenecks This section describes bottleneck solutions using the following commands: iostat, filemon. 1. iostat NOTE: High iowait will cause slower performance. Run the following command: iostat 5 %iowait percentage of time the CPU is idle while waiting on local I/O %idle percentage of time the CPU is idle while not waiting on local I/O The time is attributed to iowait when no processes are ready for the CPU but at least one process is waiting on I/O. A high percentage of iowait time indicates that disk I/O is a major contributor to the delay in execution of processes. In general, if system slowness occurs and %iowait is 20% to 25% or higher, investigation of a disk bottleneck is in order. %tm_act percentage of time the disk is busy NOTE: High tm_act percentage can indicate a disk bottleneck. When %tm_act or time active for a disk is high, noticeable performance degradation can occur. On some systems, a %tm_act of 35% or higher for one disk can cause noticeably slow performance. o Look for busy vs. idle drives. Moving data from more busy to less busy drives may help alleviate a disk bottleneck. o Check for paging activity by following the instructions in the "Memory bottlenecks" section. Paging to and from disk will contribute to the I/O load. 2. filemon To find out what files, logical volumes, and disks are most active, run the following command as root: # filemon -u -O all -o /tmp/fmon.out; sleep 30;trcstop In 30 seconds, a report is created in /tmp/fmon.out. o Check for most active segments, logical volumes, and physical volumes in this report. o Check for reads and writes to paging space to determine if the disk activity is true application I/O or is due to paging activity. o Check for files and logical volumes that are particularly active. If these are on a busy physical volume, moving some data to a less busy disk may improve performance. The Most Active Segments report lists the most active files by file system and inode. The mount point of the file system and inode of the file can be used with the ncheck command to identify unknown files: # ncheck -iThis report is useful in determining if the activity is to a file system (segtype = persistent), the JFS log (segtype = log), or to paging space (segtype = working). By examining the reads and read sequences counts, you can determine if the access is sequential or random. As the read sequences count approaches the reads count, file access is more random. The same applies to the writes and write sequences. ------------------------------------------------------------------------ SMP performance tuning Performance tools 1. SMP only (some SMPs do not support this command) cpu_state -l This displays the current state of each processor (enabled, disabled, or unavailable). 2. AIX tools that have been adapted in order to display more meaninful information on SMP systems ps -m -o THREAD The BND column will indicate the processor number to which a process/thread is bound, if it is bound. pstat -A The CPUID column will indicate the processor number to which a process/thread is bound. sar -P ALL Load on all the processors. vmstat Displays kthr (kernel threads). netpmon -t Prints CPU reports on a per-thread basis. 3. Other AIX tools that did not change filemon iostat svmon tprof ------------------------------------------------------------------------ Tuning methodology 1. Check availablity of processors. cpu_state -l 2. Check balance between processors. sar -P ALL 3. Identify bound processes/threads. ps -m -o THREAD pstat -A 4. Unbind any bound processes/threads that can and should be unbound. 5. Continue as with uniprocessor system. ------------------------------------------------------------------------ Additional information 1. KBUFFERS vs. VMM The Block I/O Buffer Cache (KBUFFERS) is only used when directly accessing a block device, such as /dev/hdisk0. Normal access through the Journaled File System (JFS) is managed by the Virtual Memory Manager (VMM) and does not use the traditional method for caching the data blocks. Any I/O operations to raw logical volumes or physical volumes does not use the Block I/O Buffer Cache. 2. I/O Pacing Users of AIX occasionally encounter long interactive-application response times when other applications in the system are running large writes to disk. Because most writes are asynchronous, FIFO I/O queues of several megabytes may build up and take several seconds to complete. The performance of an interactive process is severely impacted if every disk read spends several seconds working through the queue. I/O pacing limits the number of outstanding I/O requests against a file. When a process tries to write to a file whose queue is at the high-water mark (should be a multiple of 4 plus 1), the process is suspended until enough I/Os have completed to bring the queue for that file to the low-water mark. The delta between the high and low water marks should be kept small. To configure I/O pacing on the system via SMIT, enter the following at the command line as root: # smitty chgsys 3. Async I/O Async I/O is performed in the background and does not block the user process. This improves performance because I/O operations and application processing can run concurrently. However, applications must be specifically written to take advantage of Aysnc I/O that is managed by the aio daemons running on the system. To configure Async I/O for the system via SMIT, enter the following at the command line as root: # smitty aio [ TechDocs Ref: 90605198014824 Publish Date: Dec. 15, 1999 4FAX Ref: 2445 ]
------------------------------------------------------------------------ Contents About this document Related documentation About vmstat Summary statistics ------------------------------------------------------------------------ About this document This document provides an overview of the output of the vmstat. This information applies to AIX Versions 4.x. Related documentation The fields produced by the s, f, and [Drives] options of vmstat are fully documented in the AIX Performance Tuning Guide, publication number SC23-2365, and in the online product documentation. The AIX and RS/6000 product documentation library is also available: www.rs6000.ibm.com/library/ ------------------------------------------------------------------------ About vmstat Although a system may have sufficient real resources, it may perform below expectations if logical resources are not allocated properly. Use vmstat to determine real and logical resource utilization. It samples kernel tables and counters and then normalizes the results and presents them in an appropriate format. By default, vmstat sends its report to standard out, but it can be run with the output redirected. vmstat is normally invoked with an interval and a count specified. The interval is the length of time in seconds over which vmstat is to gather and report data. The count is the number of intervals to run. If no parameters are specified, vmstat will report a single record of statistics for the time since the system was booted. There may have been inactivity or fluctuations in the workload, so the results may not represent current activity. Be aware that the first record in the output presents statistics since the last boot (except when invoked with the -f or -s option). In many instances, this data can be ignored. vmstat reports statistics about processes, virtual memory, paging activity, faults, CPU activity, and disk transfers. Options and parameters recognized by this tool are indicated by the usage prompt: vmstat [-fs] [Drives] [Interval] [Count] The following figure lists output where the smallest work unit is called a kernel thread (kthr). The r and b under this column represent the number of "threads", not processes, placed on these queues. -------------------------------------------------------------------- | | | kthr memory page faults cpu | | ----- -------- ----------------- -------------- ----------- | | r b avm fre re pi po fr sr cy in sy cs us sy id wa | | 0 0 6747 1253 0 0 0 0 0 0 114 10 22 0 1 26 0 | | 1 0 6747 1253 0 0 0 0 0 0 113 118 43 17 4 79 0 | | 0 0 6747 1253 0 0 0 0 0 0 118 99 33 8 3 89 0 | | | -------------------------------------------------------------------- Figure: Sample output from vmstat 1 3 kthr The columns under the kthr heading in the output provide information about the average number of threads on various queues. r The r column indicates the average number of kernel threads on the run queue at one-second intervals. This field indicated the number of "run-able" threads. The system counts the number of ready-to-run threads once per second and adds that number to an internal counter. vmstat then subtracts the initial value of this counter from the end value and divides the result by the number of seconds in the measurement interval. This value is typically less than five with a stable workload. If this value increases rapidly, look for an application problem. If there are many threads (especially CPU-intensive ones) competing for the CPU resource, it is quite possible they will be scheduled in round-robin fashion. If each one executes for a complete or partial time slice, the number of "run-able" threads could easily exceed 100. b The b column shows the average number of kernel threads on the wait queue at one-second intervals (awaiting resource, awaiting input/output). Kernel threads are placed on the wait queue when scheduled for execution and are waiting for one of their process pages to be paged in. Once a second, the system counts the threads waiting and adds that number to an internal counter. vmstat then subtracts the initial value from the end value and divides the result by the number of seconds in the measurement interval. This value is usually near zero. Do not confuse this with wa -- waiting on input/output (I/O). NOTE: On an SMP system, there will be an additional blocked process shown in the b column. This is for the lrud kproc that is part of the Virtual Memory Manager's (VMM) page-replacement algorithm. Also, on a system with a compressed journaled file system (JFS) that is mounted, there will be an additional blocked process: the jfsc kproc. memory The information under the memory heading provides information about real and virtual memory. avm The avm column gives the average number of pages allocated to paging space. (In AIX, a page contains 4096 bytes of data.) When a process executes, space for working storage is allocated on the paging devices (backing store). This can be used to calculate the amount of paging space assigned to executing processes. The number in the avm field divided by 256 will yield the number of megabytes (MB), systemwide, allocated to page space. The lsps -a command also provides information on individual paging space. It is recommended that enough paging space be configured on the system so that the paging space used does not approach 100 percent. When fewer than 128 unallocated pages remain on the paging devices, the system will begin to kill processes to free some paging space. Versions of AIX before 4.3.2 allocated paging space blocks for pages of memory as the pages were accessed. On a large memory machine, where the application set is such that paging is never or rarely required, these paging space blocks were allocated but never needed. AIX Version 4.3.2 implements deferred paging space allocation, in which the paging space blocks are not allocated until paging is necessary, thus, helping reduce the paging space requirements of the system. The avm value in vmstat indicates the number of virtual memory (working storage) pages that have been accessed but not necessarily paged out. With the previous policy of "late page space allocation", avm had the same definition. However, since the VMM would allocate paging space disk blocks for each working page that was accessed, the paging space blocks was equal to the avm. The reason for the paging space blocks to be allocated at the time the working pages are accessed is so that if the pages had to be paged out of memory, there would be disk blocks on the page space lv's available for the in-memory pages to go. On systems that never page-out to page-space, it's a waste of disk space to have as many page space disk blocks as there is memory. With deferred policy, the page space disk blocks are only allocated for the pages that do need to be paged out. The avm number will grow as more processes get started and/or existing processes use more working storage. Likewise, the number will shrink as processes exit and/or free working storage. fre The fre column shows the average number of free memory frames. A frame is a 4096-byte area of real memory. The system maintains a buffer of memory frames, called the free list, that will be readily accessible when the VMM needs space. The nominal size of the free list varies depending on the amount of real memory installed. On systems with 64MB of memory or more, the minimum value (MINFREE) is 120 frames. For systems with less than 64MB, the value is two times the number of MB of real memory, minus 8. For example, a system with 32MB would have a MINFREE value of 56 free frames. If the fre value is substantially above the MAXFREE value (which is defined as MINFREE plus 8), then it is unlikely that the system is thrashing (continuously paging in and out). However, if the system is thrashing, be assured that the fre value is small. Most UNIX and AIX operating systems will use nearly all available memory for disk caching, so you need not be alarmed if the fre value oscillates between MINFREE and MAXFREE. page The information under the page heading includes information about page faults and paging activity. re The re column shows the number (rate) of pages reclaimed. Reclaimed pages can satisfy an address translation fault without initiating a new I/O request (the page is still in memory). This includes pages that have been put on the free list but are accessed again before they are reassigned. It includes pages previously requested by VMM for which I/O has not yet been completed or those pre-fetched by VMM's read-ahead mechanism but hidden from the faulting segment. pi The pi column details the number (rate) of pages paged in from paging space. Paging space is the part of virtual memory that resides on disk. It is used as an overflow when memory is overcommitted. Paging consists of paging logical volumes dedicated to the storage of working set pages that have been stolen from real memory. When a stolen page is referenced by the process, a page fault occurs and the page must be read into memory from paging space. There is no "good" number for this due to the variety of configurations of hardware, software, and applications. One theory is that five page-ins per second should be the upper limit. Use this theoretical maximum as a reference but do not adhere to it rigidly. This field is important as a key indicator of paging space activity. If a page-in occurs, then there must have been a previous page-out for that page. It is also likely in a memory-constrained environment that each page-in will force a different page to be stolen and, therefore, paged out. po The po column shows the number (rate) of pages paged out to paging space. Whenever a page of working storage is stolen, it is written to paging space. If not referenced again, it remains on the paging device until the process terminates or disclaims the space. Subsequent references to addresses contained within the faulted-out pages result in page faults, and the pages are paged in individually by the system. When a process terminates normally, any paging space allocated to that process is freed. If the system is reading in a significant number of persistent pages, you may see an increase in po without corresponding increases in pi. This situation does not necessarily indicate thrashing, but may warrant investigation into data access patterns of the applications. fr The fr column details the number (rate) of pages freed. As the VMM page-replacement code routine scans the Page Frame Table (PFT), it uses criteria to select which pages are to be stolen to replenish the free list of available memory frames. The total pages stolen by the VMM -- both working (computational) and file (persistent) pages -- are reported as a rate per second. Just because a page has been freed does not mean that any I/O has taken place. For example, if a persistent storage (file) page has not been modified, it will not be written back to the disk. If I/O is not necessary, minimal system resources are required to free a page. sr The sr column details the number (rate) of pages scanned by the page-placement algorithm. The VMM page-replacement code scans the PFT and steals pages until the number of frames on the free list is at least the MAXFREE value. The page-replacement code may have to scan many entries in the Page Frame Table before it can steal enough to satisfy the free list requirements. With stable, unfragmented memory, the scan rate and free rate may be nearly equal. On systems with multiple processes using many different pages, the pages are more volatile and disjointed. In this scenario, the scan rate may greatly exceed the free rate. cy The cy column provides the rate of complete scans of the Page Frame Table. cy shows how many times (per second) the page-replacement code has scanned the Page Frame Table. Since the free list can be replenished without a complete scan of the PFT and because all of the vmstat fields are reported as integers, this field is usually zero. faults The information under the faults heading in the vmstat output provides information about process control. in The in column shows the number (rate) of device interrupts. This column shows the number of hardware or device interrupts (per second) observed over the measurement interval. Examples of interrupts are disk request completions and the 10 millisecond clock interrupt. Since the latter occurs 100 times per second, the in field is always greater than 100. sy The sy column details the number (rate) of system calls. Resources are available to user processes through well-defined system calls. These calls instruct the kernel to perform operations for the calling process and exchange data between the kernel and the process. Since workloads and applications vary and different calls perform different functions, it is impossible to say how many system calls per second are too many. cs The cs column shows the number (rate) of context switches. The physical CPU resource is subdivided into logical time slices of 10 milliseconds each. Assuming a process is scheduled for execution, it will run until its time slice expires, it is preempted, or it voluntarily gives up control of the CPU. When another process is given control of the CPU, the context, or working environment, of the previous process must be saved and the context of the current process must be loaded. AIX has a very efficient context switching procedure, so each switch is inexpensive in terms of resources. Any significant increase in context switches is cause for further investigation. cpu The information under the cpu heading in the vmstat output provides a breakdown of CPU usage. us The us column shows the percent of CPU time spent in user mode. Processes execute in user mode or system (kernel) mode. When in user mode, a process executes within its code and does not require kernel resources to perform computations, manage memory, set variables, and so on. sy The sy column details the percent of CPU time spent in system mode. If a process needs kernel resources, it must execute a call and go into system mode to make that resource available. I/O to a drive, for example, requires a call to open the device, seek, and read and write data. This field shows the percent of time the CPU was in system mode. Optimum use would have the CPU working 100 percent of the time. This holds true in the case of a single-user system with no need to share the CPU. Generally, if us+sy time is below 90 percent, a single-user system is not considered CPU constrained. However, if us+sy time on a multi-user system exceeds 80 percent, the processes may spend time waiting in the run queue. Response time and throughput might suffer. id If there are no processes available for execution (the run queue is empty), the system dispatches a process called wait. The ps report (with the -k or g option) identifies this as kproc with a process ID (PID) of 516. Do not worry if your ps report shows a high aggregate time for this process. It means you have had significant periods of time when no other processes could run. If there are no I/Os pending to a local disk, all time charged to the wait process is classified as idle time. wa The wa column details CPU idle time (percent) with pending local disk I/O. If there is at least one outstanding I/O to a local disk when the wait process is running, the time is classified as "waiting on I/O". A wa value over 40 percent could indicate that the disk subsystem may not be balanced properly, or it may be the result of a disk-intensive workload. If there is only one process available for execution -- often the case on a technical workstation -- there may be no way to avoid waiting on I/O. NOTE: The wa column on SMP machines running AIX Version 4.3.2 or earlier is somewhat exaggerated. This is due to the method used in calculating wio. Method used in AIX 4.3.2 and earlier AIX Versions At each clock interrupt on each processor (100 times a second in AIX), a determination is made as to which of four categories (usr/sys/wio/idle) to place the last 10 milliseconds of time. If the CPU was busy in usr mode at the time of the clock interrupt, then usr gets the clock tick added into its category. If the CPU was busy in kernel mode at the time of the clock interrupt, then the sys category gets the tick. If the CPU was NOT busy, then a check is made to see if any I/O to disk is in progress. If any disk I/O is in progress, then the wio category is incremented. If NO disk I/O is in progress and the CPU is not busy, then the idl category gets the tick. ------------------------------------------------------------------------ Summary statistics vmstat with the -s option reports absolute counts of various events since the system was booted. There are 23 separate events reported in the vmstat -s output; the following 4 have proven most helpful. The 19 remaining fields contain a variety of activities from address translation faults to lock misses to system calls. The information in those 19 fields is also valuable but is less frequently used. page ins The page ins field shows the number systemwide page-ins. When a page is read from disk to memory, this count is incremented. It is a count of VMM-initiated read operations and, with the page outs field, represents the real I/O (disk reads and writes) initiated by the VMM. page outs The page outs field shows the number of systemwide page-outs. The process of writing pages to the disk is count incremented. The page outs field value is a total count of VMM-initiated write operations and, with the page ins field, represents the total amount of real I/O initiated by the VMM. paging space page ins The paging space page ins field is the count of ONLY pages read from paging space. paging space page outs The paging space page outs field is the count of ONLY pages written to paging space. Using the summary statistics The four preceding fields can be used to indicate how much of the system's I/O is for persistent storage. If the value for paging space page ins is subtracted from the (systemwide) value for page ins, the result is the number of pages that were read from persistent storage (files). Likewise, if the value for paging space page outs is subtracted from the (systemwide) value for page outs, the result is the number of persistent pages (files) that were written to disk. Remember that these counts apply to the time since system initialization. If you need counts for a given time interval, execute vmstat -s at the time you want to start monitoring and again at the end of the interval. The deltas between like fields of successive reports will be the count for the interval. It is easier to redirect the output of the reports to a file and then perform the math. [ TechDocs Ref: 90605226914708 Publish Date: Nov. 04, 1999 4FAX Ref: 6220 ]
Contents About this document Syntax TTY statistics in the iostat output CPU statistics in the iostat output Disk statistics in the iostat output Analyzing the data Conclusions About this document This document is based on "Performance Tuning: A Continuing Series -- The iostat Tool", by Barry Saad, from the January/February 1994 issue of AIXTRA: IBM'S MAGAZINE FOR AIX PROFESSIONALS. This article discusses iostat and how it can identify I/O-subsystem and CPU bottlenecks. iostat works by sampling the kernel's address space and extracting data from various counters that are updated every clock tick (1 clock tick = 10 milliseconds [ms]). The results -- covering TTY, CPU, and I/O subsystem activity -- are reported as per-second rates or as absolute values for the specified interval. This document applies to AIX Versions 4.x. ------------------------------------------------------------------------ Syntax Normally, iostat is issued with both an interval and a count specified, with the report sent to standard output or redirected to a file. The command syntax appears below: iostat [-t] [-d] [Drives] [Interval [Count]] The -d flag causes iostat to provide only disk statistics for all drives. The -t flag causes iostat to provide only system-wide TTY and CPU statistics. NOTE: The -t and -d options are mutually exclusive. If you specify one or more drives, the output is limited to those drives. Multiple drives can be specified; separate them in the syntax with spaces. You can specify a time in seconds for the interval between records to be included in the reports. The initial record contains statistics for the time since system boot. Succeeding records contain data for the preceding interval. If no interval is specified, a single record is generated. If you specify an interval, the count of the number of records to be included in the report can also be specified. If you specify an interval without a count, iostat will continue running until it is killed. -------------------------------------------------------------------- | | | tty: tin tout avg-cpu: % user % sys % idle % iowait| | 2.2 3.3 0.4 1.3 97.7 0.6 | | | | Disks: % tm_act Kbps tps msps Kb_read Kb_wrtn | | hdisk0 0.4 1.1 0.3 117675 1087266 | | hdisk1 0.3 1.0 0.2 59230 1017734 | | hdisk2 0.0 0.2 0.0 180189 46832 | | cd0 0.0 0.0 0.0 0 0 | | | | tty: tin tout avg-cpu: % user % sys % idle % iowait| | 2.2 3.3 0.4 1.3 97.7 0.6 | | | | Disks: % tm_act Kbps tps msps Kb_read Kb_wrtn | | hdisk0 0.4 1.1 0.3 117675 1087266 | | hdisk1 0.3 1.0 0.2 59230 1017734 | | hdisk2 0.0 0.2 0.0 180189 46832 | | cd0 0.0 0.0 0.0 0 0 | | | -------------------------------------------------------------------- Figure 1. Sample Output from iostat 2 2 The following sections explain the output. ------------------------------------------------------------------------ TTY statistics in the iostat output The two columns of TTY information (tin and tout) in the iostat output show the number of characters read and written by all TTY devices, including both real and pseudo TTY devices. Real TTY devices are those connected to an asynchronous port. Some "pseudo TTY devices" are shells, telnet sessions, and aixterms. * tin shows the total characters per second read by all TTY devices. * tout indicates the total characters per second written to all TTY devices. Generally there are fewer input characters than output characters. For example, assume you run the following: iostat -t 1 30 cd /usr/sbin ls -l You will see few input characters and many output characters. On the other hand, applications such as vi result in a smaller difference between the number of input and output characters. Analysts using modems for asynchronous file transfer may notice the number of input characters exceeding the number of output characters. Naturally, this depends on whether the files are being sent or received relative to the measured system. Since the processing of input and output characters consumes CPU resource, look for a correlation between increased TTY activity and CPU utilization. If such a relationship exists, evaluate ways to improve the performance of the TTY subsystem. Steps that could be taken include changing the application program, modifying TTY port parameters during file transfer, or perhaps upgrading to a faster or more efficient asynchronous communications adapter. ------------------------------------------------------------------------ CPU statistics in the iostat output The CPU statistics columns (% user, % sys, % idle, and % iowait) provide a breakdown of CPU usage. This information is also reported in the output of the vmstat command in the columns labeled us, sy, id, and wa. % user The % user column shows the percentage of CPU resource spent in user mode. A UNIX process can execute in user or system mode. When in user mode, a process executes within its own code and does not require kernel resources. % sys The % sys column shows the percentage of CPU resource spent in system mode. This includes CPU resource consumed by kernel processes (kprocs) and others that need access to kernel resources. For example, the reading or writing of a file requires kernel resources to open the file, seek a specific location, and read or write data. A UNIX process accesses kernel resources by issuing system calls. Typically, the CPU is pacing (the system is CPU bound) if the sum of user and system time exceeds 90 percent of CPU resource on a single-user system or 80 percent on a multi-user system. This condition means that the CPU is the limiting factor in system performance. The ratio of user to system mode is determined by workload and is more important when tuning an application than when evaluating performance. A key factor when evaluating CPU performance is the size of the run queue (provided by the vmstat command). In general, as the run queue increases, users will notice degradation (an increase) in response time. % idle The % idle column shows the percentage of CPU time spent idle, or waiting, without pending local disk I/O. If there are no processes on the run queue, the system dispatches a special kernel process called wait. On most AIX systems, the wait process ID (PID) is 516. % iowait The % iowait column shows the percentage of time the CPU was idle with pending local disk I/O. The iowait state is different from the idle state in that at least one process is waiting for local disk I/O requests to complete. Unless the process is using asynchronous I/O, an I/O request to disk causes the calling process to block (or sleep) until the request is completed. Once a process's I/O request completes, it is placed on the run queue. In general, a high iowait percentage indicates the system has a memory shortage or an inefficient I/O subsystem configuration. Understanding the I/O bottleneck and improving the efficiency of the I/O subsystem require more data than iostat can provide. However, typical solutions might include: * limiting the number of active logical volumes and file systems placed on a particular physical disk (The idea is to balance file I/O evenly across all physical disk drives.) * spreading a logical volume across multiple physical disks (This is useful when a number of different files are being accessed.) * creating multiple JFS logs for a volume group and assigning them to specific file systems (This is beneficial for applications that create, delete, or modify a large number of files, particularly temporary files.) * backing up and restoring file systems to reduce fragmentation (Fragmentation causes the drive to seek excessively and can be a large portion of overall response time.) * adding additional drives and rebalancing the existing I/O subsystem On systems running a primary application, high I/O wait percentage may be related to workload. In this case, there may be no way to overcome the problem. On systems with many processes, some will be running while others wait for I/O. In this case, the iowait can be small or zero because running processes "hide" wait time. Although iowait is low, a bottleneck may still limit application performance. To understand the I/O subsystem thoroughly, examine the statistics in the next section. NOTE: The %iowait column on SMP machines running AIX Versions 4.3.2 or earlier is somewhat exaggerated. This is due to the method used in calculating wio. Method used in AIX 4.3.2 and earlier AIX Versions At each clock interrupt on each processor (100 times a second in AIX), a determination is made as to which of four categories (usr/sys/wio/idle) to place the last 10 ms of time. If the CPU was busy in usr mode at the time of the clock interrupt, then usr gets the clock tick added into its category. If the CPU was busy in kernel mode at the time of the clock interrupt, then the sys category gets the tick. If the CPU was NOT busy, then a check is made to see if any I/O to disk is in progress. If any disk I/O is in progress, then the wio category is incremented. If NO disk I/O is in progress and the CPU is not busy, then the idl category gets the tick. ------------------------------------------------------------------------ Disk statistics in the iostat output The disk statistics portion of the iostat output provides a breakdown of I/O usage. This information is useful in determining whether a physical disk is limiting performance. Disk I/O history The system maintains a history of disk activity by default. Note that history is disabled if you see the message: Disk history since boot not available. This message displays only in the first output record from iostat. Disk I/O history should be enabled since the CPU resource used in maintaining it is insignificant. History-keeping can be disabled or enabled in SMIT under the following path: -> System Environments -> Change/Show Characteristics of Operating System -> Continuously maintain DISK I/O history -> true | false Choose true to enable history-keeping or false to disable it. Disks The Disks: column shows the names of the physical volumes. They are either hdisk or cd followed by a number. (hdisk0 and cd0 refer to the first physical disk drive and the first CD disk drive, respectively.) % tm_act The % tm_act column shows the percentage of time the volume was active. This is the primary indicator of a bottleneck. A drive is active during data transfer and command processing, such as seeking to a new location. The disk-use percentage is directly proportional to resource contention and inversely proportional to performance. As disk use increases, performance decreases and response time increases. In general, when a disk's use exceeds 70 percent, processes are waiting longer than necessary for I/O to complete because most UNIX processes block (or sleep) while waiting for their I/O requests to complete. Kbps Kbps shows the amount of data read from and written to the drive in KBs per second. This is the sum of Kb_read plus Kb_wrtn, divided by the number of seconds in the reporting interval. tps tps reports the number of transfers per second. A transfer is an I/O request at the device driver level. Kb_read Kb_read reports the total data (in KBs) read from the physical volume during the measured interval. Kb_wrtn Kb_wrtn shows the amount of data (in KBs) written to the physical volume during the measured interval. ------------------------------------------------------------------------ Analyzing the data Taken alone, there is no unacceptable value for any of the preceding fields because statistics are too closely related to application characteristics, system configuration, and types of physical disk drives and adapters. Therefore, when evaluating data, you must look for patterns and relationships. The most common relationship is between disk utilization and data transfer rate. To draw any valid conclusions from this data, you must understand the application's disk data access patterns -- sequential, random, or a combination -- and the type of physical disk drives and adapters on the system. For example, if an application reads and writes sequentially, you should expect a high disk-transfer rate when you have a high disk-busy rate. (NOTE: Kb_read and Kb_wrtn can confirm an understanding of an application's read and write behavior but they provide no information on the data access patterns.) Generally you do not need to be concerned about a high disk-busy rate as long as the disk-transfer rate is also high. However, if you get a high disk-busy rate and a low data-transfer rate, you may have a fragmented logical volume, file system, or individual file. What is a high data-transfer rate? That depends on the disk drive and the effective data-transfer rate for that drive. You should expect numbers between the effective sequential and effective random disk-transfer rates. Below is a chart of effective transfer rates for several common SCSI-1 and SCSI-2 disk drives. Table 1. Effective Transfer Rate (KB/sec), Part 1 of 2 TYPE OF ACCESS 400 MB DRIVE 670 MB DRIVE 857 MB DRIVE Read-Sequential 1589 1525 2142 Read-Random 241 172 262 Write-Sequential 1185 1108 1588 Write-Random 327 275 367 Table 2. Effective Transfer Rate (KB/sec), Part 2 of 2 TYPE OF ACCESS 1.2GB DRIVE 1.37GB DRIVE 1.2GB DRIVE 1.37GB S-2 DRIVE Read-Sequential 2169 2667 2180 3123 Read-Random 292 299 385 288 Write-Sequential 1464 2189 2156 2357 Write-Random 362 491 405 549 The transfer rates were determined during performance testing and give more accurate expectations of disk performance than the media-transfer rate, which reflects the hardware capability and does not account for operating system and application overhead. Another use of the data is to answer the question: "Do I need another SCSI adapter?" If you've ever been asked this question, you probably provided a generic answer or just plain guessed. You can use data captured by iostat to answer the question accurately by tracking transfer rates, finding the maximum data-transfer rate for each disk. Assume that the maximum rate occurs simultaneously for all drives (the worst case). For maximum aggregate performance, the measured transfer rates for drives attached to a given adapter must be below the effective SCSI adapter throughput rating. For planning purposes, you should use 70 percent of the adapter's rated throughput (for example, 2.8 MB per second for a SCSI-1 adapter). This percentage should provide a sufficient buffer for occasional peak rates that may occur. When adding a drive, you must assume the data-transfer rate. At least you will have the collected data and the effective transfer rates to use as a basis. Keep in mind that the SCSI adapter may be saturated if the data-transfer rates over multiple intervals approach the effective SCSI adapter throughput rating. In that case, the preceding analysis is invalid. ------------------------------------------------------------------------ Conclusions The primary purpose of the iostat tool is to detect I/O bottlenecks by monitoring the disk utilization (%tm_act field). iostat can also be used to identify CPU problems, assist in capacity planning, and provide insight into solving I/O problems. Armed with both vmstat and iostat, you can capture the data required to identify performance problems related to CPU, memory, and I/O subsystems. [ TechDocs Ref: 90605205314704 Publish Date: Oct. 15, 1999 4FAX Ref: 9779 ]
------------------------------------------------------------------------ Contents About this document VMM overview Real-Memory management Free list Persistent vs. memory segments Paging space and virtual memory VMM memory load control facility VMSTAT's avm field VMSTAT's fre field How the system is using memory Explanation of svmon output ------------------------------------------------------------------------ About this document This document addresses how RAM and paging space are used. This information applies to AIX Version 4.x. ------------------------------------------------------------------------ VMM overview The Virtual Memory Manager (VMM) services memory requests from the system and its applications. Virtual-memory segments are partitioned in units called pages; each page is either located in physical memory (RAM) or stored on disk until it is needed. AIX uses virtual memory in order to address more memory than is physically available in the system. The management of memory pages in RAM or on disk is handled by the VMM. ------------------------------------------------------------------------ Real-Memory management In AIX, virtual-memory segments are partitioned into 4096-byte units called pages. Real memory is divided into 4096-byte page frames. The VMM has two major functions: 1) manage the allocation of page frames, and 2) resolve references to virtual-memory pages that are not currently in RAM (stored in paging space) or do not yet exist. In order to accomplish its task, the VMM maintains a free list of available page frames. The VMM also uses a page-replacement algorithm to determine which virtual-memory pages currently in RAM will have their page frames reassigned to the free list. The page-replacement algorithm takes into account the existence of persistent vs. working segments, repaging, and VMM thresholds. ------------------------------------------------------------------------ Free list The VMM maintains a list of free page frames that it uses to satisfy page faults. The free list is made up of unallocated page frames. AIX tries to use all of RAM all of the time, except for a small amount which it maintains on the free list. To maintain this small amount of unallocated pages the VMM will use Page Outs and Page Steals to free up space and reassign those page frames to the free list. The virtual-memory pages whose page frames are to be reassigned are selected via the VMM's page-replacement algorithm. ------------------------------------------------------------------------ Persistent vs. working memory segments AIX distinguishes between different types of memory segments, and to understand the Virtual Memory Manager, it is important to understand the difference between working and persistent segments. A persistent segment has a permanent storage location on disk. Files containing data or executable programs are mapped to persistent segments. When a Journaled File System (JFS) file is opened and accessed the file data is copied into RAM. VMM parameters control when physical memory frames allocated to persistent pages should be overwritten and used to store other data. Working segments are transitory and exist only during their use by a process, and have no permanent disk storage location. Process stack and data regions are mapped to working segments and shared library text segments. Pages of working segments must also have disk storage locations to occupy when they cannot be kept in real memory. The disk paging space is used for this purpose. When a program exits, all of its working pages are placed back on the free list immediately. ------------------------------------------------------------------------ Paging space and virtual memory Working pages in RAM that can be modified and paged out are assigned a corresponding slot in paging space. The allocated paging space will only be used if the page needs to be paged out. However, an allocated page in paging space cannot be used by another page. It remains reserved for a particular page for as long as it exists in virtual-memory. Since persistent pages are paged out to their location on disk from which they came, paging space does not need to be allocated for persistent pages residing in RAM. The VMM has three modes for allocating paging space: early paging space allocation, late paging space allocation and deferred paging space allocation. Early allocation policy reserves paging space whenever a memory request for a working page is made. Late allocation policy only assigns paging space when the working page is referenced. AIX Versions 4.3.2 and later implement a deferred paging space allocation policy. With deferred paging space policy, paging space blocks are not allocated until the working pages are actually paged out of memory. This significantly reduces the paging space requirements of the system. ------------------------------------------------------------------------ VMM memory load control facility When a process references a virtual-memory page that is on disk, because it either has been paged out or has never been read, the referenced page must be paged in, and this may cause one or more pages to be paged out if the number of available free page frames is low. The VMM attempts to steal page frames that have not been recently referenced, and thus unlikely to be referenced in the near future, via the page-replacement algorithm. A successful page-replacement keeps the memory pages of all currently active processes in RAM, while the memory pages of inactive processes are paged out. However, when RAM is over-committed, it becomes difficult to choose pages for page out because they will be referenced in the near future by currently running processes. The result is that pages that will soon be referenced still get paged out and then paged in again later. When this happens, continuous paging in and paging out may occur if RAM is over-committed. This condition is called thrashing. The system spends most of its time paging in and paging out instead of executing useful instructions, and none of the active processes make any significant progress. The VMM has a memory load control algorithm that detects when the system is thrashing and then attempts to correct the condition. ------------------------------------------------------------------------ VMSTAT'S avm field avm stands for "Active Virtual Memory" and not "Available Memory". The avm value in VMSTAT indicates the number of virtual-memory pages that have been accessed but not necessarily paged out. With the previous policy of late page space allocation, avm had the same definition but since the VMM would allocate paging space disk blocks for each working page that was accessed, then the paging space blocks were equal to the avm. With deferred policy, the page space disk blocks are only allocated for the pages that need to be paged out. The avm number will grow as more processes get started and/or existing processes use more working storage. Likewise, the number will shrink as processes exit and/or free working storage. ------------------------------------------------------------------------ VMSTAT'S fre field fre is the number of 4K pages that are currently on the free list. When an application terminates, all of its working pages are immediately returned to the free list. Its persistent pages, however, remain in RAM and are not added back to the free list until they are stolen by the VMM for other programs. Persistent pages are also freed if the corresponding file is deleted. For this reason, the fre value may not indicate all the real memory that can be readily available for use by processes. If a page frame is needed, then persistent pages related to terminated applications are among the first to be handed over to another program. The minimum number of pages that the Virtual Memory Manager keeps on the free list is determined by the minfree parameter of vmtune. If the number of pages on the free list drops below minfree, the Virtual Memory Manager will steal pages until the free list has been restored to the maxfree value. ------------------------------------------------------------------------ How the system is using memory The svmon command can be used to determine roughly how much memory the system is using. NOTE: PAIDE/6000 must be installed in order to use svmon. Check to see if this is installed, by executing the following command: $ lslpp -l perfagent.tools. If you are at AIX Version 4.3.0 or higher, then this file can be found on the AIX Base Operating System media. Otherwise, to order PAIDE/6000, call IBM DIRECT 1-800-426-2255 or contact your local IBM representative. As root, type svmon. Under the pgspace heading, the inuse field is the number of working pages that are in use in all of virtual memory. ------------------------------------------------------------------------ Explanation of svmon output memory: SIZE total size of memory in 4K pages INUSE number of pages in RAM that are in use by a process plus the number of persistent pages that belonged to a terminated process and are still resident in RAM. This value is the total size of memory minus the number of pages on the free list. FREE number of pages on free list. PIN number of pages pinned in RAM (a pinned page is a page that is always resident in RAM and cannot be paged out) in use: WORK number of working pages in RAM PERS number of persistent pages in RAM CLNT number of client pages in RAM (client page is a remote file page) pin: WORK number of working pages pinned in RAM PERS number of persistent pages pinned in RAM CLNT number of client pages pinned in RAM pgspace: SIZE total size of paging space in 4K pages INUSE total number of allocated slots. (See explanation above on allocation of paging space). To find out how much memory a process is using, type $ svmon -P PID (for one process) or $ svmon -Pau | more (for all processes) To see the number of working pages unique to this process' private stack and data use in all of virtual memory, look at the work type and description private. The svmon output may also list several shared segments. For a complete picture, determine which segments are unique to an individual process and which are shared with other programs. Multiply the values by 4096 to get the number of bytes in memory the process is using. The number 4096 comes from the fact that each page is 4KB in size. You can also divide the number of pages by 256 in order to get megabytes. [ TechDocs Ref: 90605226714706 Publish Date: May. 05, 2000 4FAX Ref: 2449 ]
Bruce Spencer,
baspence@us.ibm.com
July 6, 2000