Diagnosing Performance Bottlenecks

AIX Tip of the Week: Diagnosing Performance Bottlenecks

Audience: Unix System Administraters

Date: July 6, 2000

The following files pertain to RS/6000 performance monitoring. The diag_perf_bottleneck file discusses how to identify AIX performance bottlenecks, the iostat and vmstat file describe how to intrepret their respective output, and the vmm file describes how AIX manages memory.

These, and other useful technical information, can be found at:

http://techsupport.services.ibm.com/rs6k/techbrowse /

Diagnosing Performance Bottlenecks

------------------------------------------------------------------------

Contents

About this document
    Related information
Memory bottlenecks
CPU bottlenecks
I/O bottlenecks
SMP performance tuning
Tuning methodology
Additional information
------------------------------------------------------------------------

About this document

This document describes how to check for resource bottlenecks and identify
the processes that tax them. Resources on a system include memory, CPU, and
Input/Output (I/O). This document covers bottlenecks across an entire
system. This document does not address the bottlenecks of a particular
application or general network problems. The following commands are
described:

   * vmstat
   * svmon
   * ps
   * tprof
   * iostat
   * netpmon
   * filemon

NOTE: PAIDE/6000 must be installed in order to use tprof, svmon, netpmon,
and filemon. To check if this is installed, run the following command:

    lslpp -l perfagent.tools.

If you are at AIX Version 4.3.0 or higher, PAIDE/6000 can be found on the
AIX Base Operating System media. Otherwise, to order PAIDE/6000, contact
your local IBM representative.

This fax also makes reference to the vmtune and schedtune commands. These
commands and their source are found in the /usr/samples/kernel directory.
They are installed with the bos.adt.samples fileset.

Related information

Consult Line Performance Analysis - The AIX Support Family offers a system
analysis with tuning recommendations. For more information call the IBM AIX
Support Center.

Performance Tuning Guide (SC23-2365) - This IBM publication covers
performance monitoring and tuning of AIX systems. Contact your local IBM
representative to order.

For detailed system usage on a per process basis, a free utlity called UTLD
can be obtained by anonymous ftp from ftp.software.ibm.com in the
/aix/tools/perftools/utld directory. For more information see the README
file /usr/lpp/utld after installation of the utld.obj fileset.

  ------------------------------------------------------------------------

Memory bottlenecks

The following section describes memory bottleneck solutions with the
following commands: vmstat, svmon, ps.

  1. vmstat

     Run the following command:


        vmstat 1

     NOTE: System may slow down when pi and po are consistently non-zero.

     pi
          number of pages per second paged in from paging space
     po
          number of pages per second paged out to paging space

     When processes on the system require more pages of memory than are
     available in RAM, working pages may be paged out to paging space and
     then paged in when they are needed again. Accessing a page from paging
     space is considerably slower than accessing a page directly from RAM.
     For this reason, constant paging activity can cause system performance
     degradation.

     NOTE: Memory is over-committed when the fr:sr ratio is high.

     fr
          number of pages that must be freed to replenish the free list or
          to accommodate an active process
     sr
          number of pages that must be examined in order to free fr number
          of pages

     An fr:sr ratio of 1:4 means for every one page freed, four pages must
     be examined. It is difficult to determine a memory constraint based on
     this ratio alone and what constitutes a high ratio is
     workload/application dependent.

     The system considers itself to be thrashing when po*SYS > fr where SYS
     is a system parameter viewed with the schedtune command. The default
     value is 0 if a system has 128MB or more which means that memory load
     control is disabled. Otherwise, the default is 6. Thrashing occurs
     when the system spends more time paging than performing work. When
     this occurs, selected processes may be suspended temporarily, and the
     system may be noticeably slower.

  2. svmon

     As root, run the following command:

         # svmon -Pau 10 | more

     Sample output:

     Pid            Command        Inuse        Pin      Pgspace
     13794             dtwm         1603          1          449
     Pid:  13794
     Command:  dtwm
     Segid Type Description        Inuse Pin Pgspace Address Range
     b23 pers /dev/hd2:24849           2   0       0 0..1
     14a5 pers /dev/hd2:24842          0   0       0 0..2
     6179 work lib data              131   0      98 0..891
     280a work shared library text  1101   0      10 0..65535
     181 work private                287   1     341 0..310:65277..65535
     57d5 pers code,/dev/hd2:61722    82   0       0 0..135

     This command lists the top ten memory using processes and gives a
     report about each one. In each process report, look where Type = work
     and Description = private. Check how many 4K (4096 byte) pages are
     used under the Pgspace column. This is the minimum number of working
     pages this segment is using in all of virtual memory. A Pgspace number
     that grows, but never decreases, may indicate a memory leak. Memory
     leaks occur when an application fails to deallocate memory.

         341 * 4096 =  1,396,736 or 1.4MB of virtual memory

  3. ps

     Run the following command:

     ps gv | head -n 1; ps gv | egrep -v "RSS" | sort +6b -7 -n -r

     size
          amount of memory in KB allocated from page space for the memory
          segment of Type = work and Description = private for the process
          as would be indicated by svmon
     RSS
          amount of memory in KB currently in use (in RAM) for the memory
          segment of Type = work and Description = private plus the memory
          segment(s) of Type = pers and Description = code for the process
          as would be indicated by svmon
     trs
          amount of memory, in KB, currently in use (in RAM) for the memory
          segment(s) of Type = pers and Description = code for the process
          as would be indicated by svmon
     %mem
          RSS value divided by the total amount of system RAM in KB
          multiplied by 100

  ------------------------------------------------------------------------

CPU bottlenecks

The following section describes CPU bottleneck solutions using the
following commands: vmstat, tprof, ps.

  1. vmstat

     Run the following command:

         vmstat 1

     NOTE: System may slow down when processes wait on the run queue.

     id
          percentage of time the CPU is idle
     r
          number of threads on the run queue

     If the id value is consistently 0%, the CPU is being used 100% of the
     time.

     Look next at the r column to see how many threads are placed on the
     run queue per second. The higher the number of threads forced to wait
     on the run queue, the more system performance will suffer.

  2. tprof

     To find out how much CPU time a process is using, run the following
     command as root:

         # tprof -x sleep 30

     This returns in 30 seconds and creates a file in the current directory
     called __prof.all.

     In 30 seconds, the CPU is checked approximately 3000 times. The Total
     column is the number of times a process was found in the CPU. If one
     process has 1500 in the Total column, this process has taken 1500/3000
     or half of the CPU time. The tprof output explains exactly what
     processes the CPU has been running. The wait process runs when no
     other processes require the CPU and accounts for the amount of idle
     time on the system.

  3. netpmon

     To find out how much CPU time a process is using, and how much of that
     time is spent executing network-related code, run the following
     command as root:

         # netpmon -o /tmp/netpmon.out -O cpu -v;sleep 30;trcstop

     This returns in 30 seconds and creates a file in the /tmp directory
     called netpmon.out.

     The CPUTime indicates the total amount of CPU time for the process.
     %CPU is the percentage of CPU usage for the process, and Network CPU%
     is the percentage of total time that the process spent executing
     network-related code.

  4. ps

     Run the following commands:

     ps -ef | head -n 1
     ps -ef | egrep -v "UID|0:00|\ 0\ " | sort +3b -4 -n -r

     Check the C column for a process penalty for recent CPU usage. The
     maximum value for this column is 120.

         ps -e | head -n 1
         ps -e | egrep -v "TIME|0:" | sort +2b -3 -n -r

     Check the Time column for process accumulated CPU time.

         ps gu
         ps gu | egrep -v "CPU|kproc" | sort +2b -3 -n -r

     Check the %CPU column for process CPU dependency. The percent CPU is
     the total CPU time divided by the the total elapsed time since the
     process was started.

  ------------------------------------------------------------------------

I/0 bottlenecks

This section describes bottleneck solutions using the following commands:
iostat, filemon.

  1. iostat

     NOTE: High iowait will cause slower performance.

     Run the following command:

         iostat 5

     %iowait
          percentage of time the CPU is idle while waiting on local I/O
     %idle
          percentage of time the CPU is idle while not waiting on local I/O

     The time is attributed to iowait when no processes are ready for the
     CPU but at least one process is waiting on I/O. A high percentage of
     iowait time indicates that disk I/O is a major contributor to the
     delay in execution of processes. In general, if system slowness occurs
     and %iowait is 20% to 25% or higher, investigation of a disk
     bottleneck is in order.

     %tm_act
          percentage of time the disk is busy

     NOTE: High tm_act percentage can indicate a disk bottleneck.

     When %tm_act or time active for a disk is high, noticeable performance
     degradation can occur. On some systems, a %tm_act of 35% or higher for
     one disk can cause noticeably slow performance.

        o Look for busy vs. idle drives. Moving data from more busy to less
          busy drives may help alleviate a disk bottleneck.
        o Check for paging activity by following the instructions in the
          "Memory bottlenecks" section. Paging to and from disk will
          contribute to the I/O load.

  2. filemon

     To find out what files, logical volumes, and disks are most active,
     run the following command as root:

         # filemon -u -O all -o /tmp/fmon.out; sleep 30;trcstop

     In 30 seconds, a report is created in /tmp/fmon.out.

        o Check for most active segments, logical volumes, and physical
          volumes in this report.
        o Check for reads and writes to paging space to determine if the
          disk activity is true application I/O or is due to paging
          activity.
        o Check for files and logical volumes that are particularly active.
          If these are on a busy physical volume, moving some data to a
          less busy disk may improve performance.

          The Most Active Segments report lists the most active files by
          file system and inode. The mount point of the file system and
          inode of the file can be used with the ncheck command to identify
          unknown files:

              # ncheck -i  

          This report is useful in determining if the activity is to a file
          system (segtype = persistent), the JFS log (segtype = log), or to
          paging space (segtype = working).

          By examining the reads and read sequences counts, you can
          determine if the access is sequential or random. As the read
          sequences count approaches the reads count, file access is more
          random. The same applies to the writes and write sequences.

  ------------------------------------------------------------------------

SMP performance tuning

Performance tools

  1. SMP only (some SMPs do not support this command)

     cpu_state -l
          This displays the current state of each processor (enabled,
          disabled, or unavailable).

  2. AIX tools that have been adapted in order to display more meaninful
     information on SMP systems
     ps -m -o THREAD
          The BND column will indicate the processor number to which a
          process/thread is bound, if it is bound.

     pstat -A
          The CPUID column will indicate the processor number to which a
          process/thread is bound.

     sar -P ALL
          Load on all the processors.

     vmstat
          Displays kthr (kernel threads).

     netpmon -t
          Prints CPU reports on a per-thread basis.

  3. Other AIX tools that did not change

     filemon
     iostat
     svmon
     tprof

  ------------------------------------------------------------------------

Tuning methodology

  1. Check availablity of processors.

         cpu_state -l

  2. Check balance between processors.

         sar -P ALL

  3. Identify bound processes/threads.

         ps -m -o THREAD
         pstat -A

  4. Unbind any bound processes/threads that can and should be unbound.

  5. Continue as with uniprocessor system.

  ------------------------------------------------------------------------

Additional information

  1. KBUFFERS vs. VMM

     The Block I/O Buffer Cache (KBUFFERS) is only used when directly
     accessing a block device, such as /dev/hdisk0. Normal access through
     the Journaled File System (JFS) is managed by the Virtual Memory
     Manager (VMM) and does not use the traditional method for caching the
     data blocks. Any I/O operations to raw logical volumes or physical
     volumes does not use the Block I/O Buffer Cache.

  2. I/O Pacing

     Users of AIX occasionally encounter long interactive-application
     response times when other applications in the system are running large
     writes to disk. Because most writes are asynchronous, FIFO I/O queues
     of several megabytes may build up and take several seconds to
     complete. The performance of an interactive process is severely
     impacted if every disk read spends several seconds working through the
     queue. I/O pacing limits the number of outstanding I/O requests
     against a file. When a process tries to write to a file whose queue is
     at the high-water mark (should be a multiple of 4 plus 1), the process
     is suspended until enough I/Os have completed to bring the queue for
     that file to the low-water mark. The delta between the high and low
     water marks should be kept small.

     To configure I/O pacing on the system via SMIT, enter the following at
     the command line as root:

         # smitty chgsys

  3. Async I/O

     Async I/O is performed in the background and does not block the user
     process. This improves performance because I/O operations and
     application processing can run concurrently. However, applications
     must be specifically written to take advantage of Aysnc I/O that is
     managed by the aio daemons running on the system.

     To configure Async I/O for the system via SMIT, enter the following at
     the command line as root:

         # smitty aio

[ TechDocs Ref: 90605198014824     Publish Date: Dec. 15, 1999     4FAX
Ref: 2445 ]

Performance Tuning -- The vmstat Tool

------------------------------------------------------------------------

Contents

About this document
Related documentation
About vmstat
Summary statistics
------------------------------------------------------------------------

About this document

This document provides an overview of the output of the vmstat. This
information applies to AIX Versions 4.x.

Related documentation

The fields produced by the s, f, and [Drives] options of vmstat are fully
documented in the AIX Performance Tuning Guide, publication number
SC23-2365, and in the online product documentation.

The AIX and RS/6000 product documentation library is also available:

www.rs6000.ibm.com/library/
  ------------------------------------------------------------------------

About vmstat

Although a system may have sufficient real resources, it may perform below
expectations if logical resources are not allocated properly.

Use vmstat to determine real and logical resource utilization. It samples
kernel tables and counters and then normalizes the results and presents
them in an appropriate format.

By default, vmstat sends its report to standard out, but it can be run with
the output redirected.

vmstat is normally invoked with an interval and a count specified. The
interval is the length of time in seconds over which vmstat is to gather
and report data. The count is the number of intervals to run. If no
parameters are specified, vmstat will report a single record of statistics
for the time since the system was booted. There may have been inactivity or
fluctuations in the workload, so the results may not represent current
activity. Be aware that the first record in the output presents statistics
since the last boot (except when invoked with the -f or -s option). In many
instances, this data can be ignored.

vmstat reports statistics about processes, virtual memory, paging activity,
faults, CPU activity, and disk transfers. Options and parameters recognized
by this tool are indicated by the usage prompt:

           vmstat [-fs] [Drives] [Interval] [Count]

The following figure lists output where the smallest work unit is called a
kernel thread (kthr). The r and b under this column represent the number of
"threads", not processes, placed on these queues.

    --------------------------------------------------------------------
   |                                                                    |
   |  kthr    memory          page           faults           cpu       |
   |  -----  --------   ----------------- --------------  -----------   |
   |  r  b   avm  fre   re pi po fr sr cy  in   sy   cs   us sy id wa   |
   |  0  0   6747 1253   0  0  0  0  0  0  114  10   22   0  1  26 0    |
   |  1  0   6747 1253   0  0  0  0  0  0  113  118  43   17 4  79 0    |
   |  0  0   6747 1253   0  0  0  0  0  0  118  99   33   8  3  89 0    |
   |                                                                    |
    --------------------------------------------------------------------
   Figure: Sample output from vmstat 1 3

kthr

     The columns under the kthr heading in the output provide information
     about the average number of threads on various queues.

     r

     The r column indicates the average number of kernel threads on the run
     queue at one-second intervals.

     This field indicated the number of "run-able" threads. The system
     counts the number of ready-to-run threads once per second and adds
     that number to an internal counter. vmstat then subtracts the initial
     value of this counter from the end value and divides the result by the
     number of seconds in the measurement interval. This value is typically
     less than five with a stable workload. If this value increases
     rapidly, look for an application problem. If there are many threads
     (especially CPU-intensive ones) competing for the CPU resource, it is
     quite possible they will be scheduled in round-robin fashion. If each
     one executes for a complete or partial time slice, the number of
     "run-able" threads could easily exceed 100.

     b

     The b column shows the average number of kernel threads on the wait
     queue at one-second intervals (awaiting resource, awaiting
     input/output).

     Kernel threads are placed on the wait queue when scheduled for
     execution and are waiting for one of their process pages to be paged
     in. Once a second, the system counts the threads waiting and adds that
     number to an internal counter. vmstat then subtracts the initial value
     from the end value and divides the result by the number of seconds in
     the measurement interval. This value is usually near zero. Do not
     confuse this with wa -- waiting on input/output (I/O).

     NOTE: On an SMP system, there will be an additional blocked process
     shown in the b column. This is for the lrud kproc that is part of the
     Virtual Memory Manager's (VMM) page-replacement algorithm.

     Also, on a system with a compressed journaled file system (JFS) that
     is mounted, there will be an additional blocked process: the jfsc
     kproc.

memory

     The information under the memory heading provides information about
     real and virtual memory.

     avm

     The avm column gives the average number of pages allocated to paging
     space. (In AIX, a page contains 4096 bytes of data.)

     When a process executes, space for working storage is allocated on the
     paging devices (backing store). This can be used to calculate the
     amount of paging space assigned to executing processes. The number in
     the avm field divided by 256 will yield the number of megabytes (MB),
     systemwide, allocated to page space. The lsps -a command also provides
     information on individual paging space. It is recommended that enough
     paging space be configured on the system so that the paging space used
     does not approach 100 percent. When fewer than 128 unallocated pages
     remain on the paging devices, the system will begin to kill processes
     to free some paging space.

     Versions of AIX before 4.3.2 allocated paging space blocks for pages
     of memory as the pages were accessed. On a large memory machine, where
     the application set is such that paging is never or rarely required,
     these paging space blocks were allocated but never needed. AIX Version
     4.3.2 implements deferred paging space allocation, in which the paging
     space blocks are not allocated until paging is necessary, thus,
     helping reduce the paging space requirements of the system. The avm
     value in vmstat indicates the number of virtual memory (working
     storage) pages that have been accessed but not necessarily paged out.
     With the previous policy of "late page space allocation", avm had the
     same definition. However, since the VMM would allocate paging space
     disk blocks for each working page that was accessed, the paging space
     blocks was equal to the avm. The reason for the paging space blocks to
     be allocated at the time the working pages are accessed is so that if
     the pages had to be paged out of memory, there would be disk blocks on
     the page space lv's available for the in-memory pages to go. On
     systems that never page-out to page-space, it's a waste of disk space
     to have as many page space disk blocks as there is memory. With
     deferred policy, the page space disk blocks are only allocated for the
     pages that do need to be paged out. The avm number will grow as more
     processes get started and/or existing processes use more working
     storage. Likewise, the number will shrink as processes exit and/or
     free working storage.

     fre

     The fre column shows the average number of free memory frames. A frame
     is a 4096-byte area of real memory.

     The system maintains a buffer of memory frames, called the free list,
     that will be readily accessible when the VMM needs space. The nominal
     size of the free list varies depending on the amount of real memory
     installed. On systems with 64MB of memory or more, the minimum value
     (MINFREE) is 120 frames. For systems with less than 64MB, the value is
     two times the number of MB of real memory, minus 8. For example, a
     system with 32MB would have a MINFREE value of 56 free frames.

     If the fre value is substantially above the MAXFREE value (which is
     defined as MINFREE plus 8), then it is unlikely that the system is
     thrashing (continuously paging in and out). However, if the system is
     thrashing, be assured that the fre value is small. Most UNIX and AIX
     operating systems will use nearly all available memory for disk
     caching, so you need not be alarmed if the fre value oscillates
     between MINFREE and MAXFREE.

page

     The information under the page heading includes information about page
     faults and paging activity.

     re

     The re column shows the number (rate) of pages reclaimed.

     Reclaimed pages can satisfy an address translation fault without
     initiating a new I/O request (the page is still in memory). This
     includes pages that have been put on the free list but are accessed
     again before they are reassigned. It includes pages previously
     requested by VMM for which I/O has not yet been completed or those
     pre-fetched by VMM's read-ahead mechanism but hidden from the faulting
     segment.

     pi

     The pi column details the number (rate) of pages paged in from paging
     space.

     Paging space is the part of virtual memory that resides on disk. It is
     used as an overflow when memory is overcommitted. Paging consists of
     paging logical volumes dedicated to the storage of working set pages
     that have been stolen from real memory. When a stolen page is
     referenced by the process, a page fault occurs and the page must be
     read into memory from paging space. There is no "good" number for this
     due to the variety of configurations of hardware, software, and
     applications.

     One theory is that five page-ins per second should be the upper limit.
     Use this theoretical maximum as a reference but do not adhere to it
     rigidly. This field is important as a key indicator of paging space
     activity. If a page-in occurs, then there must have been a previous
     page-out for that page. It is also likely in a memory-constrained
     environment that each page-in will force a different page to be stolen
     and, therefore, paged out.

     po

     The po column shows the number (rate) of pages paged out to paging
     space.

     Whenever a page of working storage is stolen, it is written to paging
     space. If not referenced again, it remains on the paging device until
     the process terminates or disclaims the space. Subsequent references
     to addresses contained within the faulted-out pages result in page
     faults, and the pages are paged in individually by the system. When a
     process terminates normally, any paging space allocated to that
     process is freed. If the system is reading in a significant number of
     persistent pages, you may see an increase in po without corresponding
     increases in pi. This situation does not necessarily indicate
     thrashing, but may warrant investigation into data access patterns of
     the applications.

     fr

     The fr column details the number (rate) of pages freed.

     As the VMM page-replacement code routine scans the Page Frame Table
     (PFT), it uses criteria to select which pages are to be stolen to
     replenish the free list of available memory frames. The total pages
     stolen by the VMM -- both working (computational) and file
     (persistent) pages -- are reported as a rate per second. Just because
     a page has been freed does not mean that any I/O has taken place. For
     example, if a persistent storage (file) page has not been modified, it
     will not be written back to the disk. If I/O is not necessary, minimal
     system resources are required to free a page.

     sr

     The sr column details the number (rate) of pages scanned by the
     page-placement algorithm.

     The VMM page-replacement code scans the PFT and steals pages until the
     number of frames on the free list is at least the MAXFREE value. The
     page-replacement code may have to scan many entries in the Page Frame
     Table before it can steal enough to satisfy the free list
     requirements. With stable, unfragmented memory, the scan rate and free
     rate may be nearly equal. On systems with multiple processes using
     many different pages, the pages are more volatile and disjointed. In
     this scenario, the scan rate may greatly exceed the free rate.

     cy

     The cy column provides the rate of complete scans of the Page Frame
     Table.

     cy shows how many times (per second) the page-replacement code has
     scanned the Page Frame Table. Since the free list can be replenished
     without a complete scan of the PFT and because all of the vmstat
     fields are reported as integers, this field is usually zero.

faults

     The information under the faults heading in the vmstat output provides
     information about process control.

     in

     The in column shows the number (rate) of device interrupts.

     This column shows the number of hardware or device interrupts (per
     second) observed over the measurement interval. Examples of interrupts
     are disk request completions and the 10 millisecond clock interrupt.
     Since the latter occurs 100 times per second, the in field is always
     greater than 100.

     sy

     The sy column details the number (rate) of system calls.

     Resources are available to user processes through well-defined system
     calls. These calls instruct the kernel to perform operations for the
     calling process and exchange data between the kernel and the process.
     Since workloads and applications vary and different calls perform
     different functions, it is impossible to say how many system calls per
     second are too many.

     cs

     The cs column shows the number (rate) of context switches.

     The physical CPU resource is subdivided into logical time slices of 10
     milliseconds each. Assuming a process is scheduled for execution, it
     will run until its time slice expires, it is preempted, or it
     voluntarily gives up control of the CPU. When another process is given
     control of the CPU, the context, or working environment, of the
     previous process must be saved and the context of the current process
     must be loaded. AIX has a very efficient context switching procedure,
     so each switch is inexpensive in terms of resources. Any significant
     increase in context switches is cause for further investigation.

cpu

     The information under the cpu heading in the vmstat output provides a
     breakdown of CPU usage.

     us

     The us column shows the percent of CPU time spent in user mode.

     Processes execute in user mode or system (kernel) mode. When in user
     mode, a process executes within its code and does not require kernel
     resources to perform computations, manage memory, set variables, and
     so on.

     sy

     The sy column details the percent of CPU time spent in system mode.

     If a process needs kernel resources, it must execute a call and go
     into system mode to make that resource available. I/O to a drive, for
     example, requires a call to open the device, seek, and read and write
     data. This field shows the percent of time the CPU was in system mode.
     Optimum use would have the CPU working 100 percent of the time. This
     holds true in the case of a single-user system with no need to share
     the CPU. Generally, if us+sy time is below 90 percent, a single-user
     system is not considered CPU constrained. However, if us+sy time on a
     multi-user system exceeds 80 percent, the processes may spend time
     waiting in the run queue. Response time and throughput might suffer.

     id

     If there are no processes available for execution (the run queue is
     empty), the system dispatches a process called wait. The ps report
     (with the -k or g option) identifies this as kproc with a process ID
     (PID) of 516. Do not worry if your ps report shows a high aggregate
     time for this process. It means you have had significant periods of
     time when no other processes could run. If there are no I/Os pending
     to a local disk, all time charged to the wait process is classified as
     idle time.

     wa

     The wa column details CPU idle time (percent) with pending local disk
     I/O.

     If there is at least one outstanding I/O to a local disk when the wait
     process is running, the time is classified as "waiting on I/O". A wa
     value over 40 percent could indicate that the disk subsystem may not
     be balanced properly, or it may be the result of a disk-intensive
     workload. If there is only one process available for execution --
     often the case on a technical workstation -- there may be no way to
     avoid waiting on I/O.

     NOTE: The wa column on SMP machines running AIX Version 4.3.2 or
     earlier is somewhat exaggerated. This is due to the method used in
     calculating wio.

     Method used in AIX 4.3.2 and earlier AIX Versions

     At each clock interrupt on each processor (100 times a second in AIX),
     a determination is made as to which of four categories
     (usr/sys/wio/idle) to place the last 10 milliseconds of time. If the
     CPU was busy in usr mode at the time of the clock interrupt, then usr
     gets the clock tick added into its category. If the CPU was busy in
     kernel mode at the time of the clock interrupt, then the sys category
     gets the tick. If the CPU was NOT busy, then a check is made to see if
     any I/O to disk is in progress. If any disk I/O is in progress, then
     the wio category is incremented. If NO disk I/O is in progress and the
     CPU is not busy, then the idl category gets the tick.

------------------------------------------------------------------------

Summary statistics

vmstat with the -s option reports absolute counts of various events since
the system was booted. There are 23 separate events reported in the vmstat
-s output; the following 4 have proven most helpful. The 19 remaining
fields contain a variety of activities from address translation faults to
lock misses to system calls. The information in those 19 fields is also
valuable but is less frequently used.

page ins

The page ins field shows the number systemwide page-ins.

When a page is read from disk to memory, this count is incremented. It is a
count of VMM-initiated read operations and, with the page outs field,
represents the real I/O (disk reads and writes) initiated by the VMM.

page outs

The page outs field shows the number of systemwide page-outs.

The process of writing pages to the disk is count incremented. The page
outs field value is a total count of VMM-initiated write operations and,
with the page ins field, represents the total amount of real I/O initiated
by the VMM.

paging space page ins

The paging space page ins field is the count of ONLY pages read from paging
space.

paging space page outs

The paging space page outs field is the count of ONLY pages written to
paging space.

Using the summary statistics

The four preceding fields can be used to indicate how much of the system's
I/O is for persistent storage. If the value for paging space page ins is
subtracted from the (systemwide) value for page ins, the result is the
number of pages that were read from persistent storage (files). Likewise,
if the value for paging space page outs is subtracted from the (systemwide)
value for page outs, the result is the number of persistent pages (files)
that were written to disk.

Remember that these counts apply to the time since system initialization.
If you need counts for a given time interval, execute vmstat -s at the time
you want to start monitoring and again at the end of the interval. The
deltas between like fields of successive reports will be the count for the
interval. It is easier to redirect the output of the reports to a file and
then perform the math.

[ TechDocs Ref: 90605226914708     Publish Date: Nov. 04, 1999     4FAX
Ref: 6220 ]

Performance Tuning -- The iostat Tool

Contents

About this document
Syntax
TTY statistics in the iostat output
CPU statistics in the iostat output
Disk statistics in the iostat output
Analyzing the data
Conclusions

About this document

This document is based on "Performance Tuning: A Continuing Series -- The
iostat Tool", by Barry Saad, from the January/February 1994 issue of
AIXTRA: IBM'S MAGAZINE FOR AIX PROFESSIONALS.

This article discusses iostat and how it can identify I/O-subsystem and CPU
bottlenecks. iostat works by sampling the kernel's address space and
extracting data from various counters that are updated every clock tick (1
clock tick = 10 milliseconds [ms]). The results -- covering TTY, CPU, and
I/O subsystem activity -- are reported as per-second rates or as absolute
values for the specified interval. This document applies to AIX Versions
4.x.

  ------------------------------------------------------------------------

Syntax

Normally, iostat is issued with both an interval and a count specified,
with the report sent to standard output or redirected to a file. The
command syntax appears below:

       iostat [-t] [-d] [Drives] [Interval   [Count]]

The -d flag causes iostat to provide only disk statistics for all drives.
The -t flag causes iostat to provide only system-wide TTY and CPU
statistics.

NOTE: The -t and -d options are mutually exclusive.

If you specify one or more drives, the output is limited to those drives.
Multiple drives can be specified; separate them in the syntax with spaces.

You can specify a time in seconds for the interval between records to be
included in the reports. The initial record contains statistics for the
time since system boot. Succeeding records contain data for the preceding
interval. If no interval is specified, a single record is generated.

If you specify an interval, the count of the number of records to be
included in the report can also be specified. If you specify an interval
without a count, iostat will continue running until it is killed.

   --------------------------------------------------------------------
  |                                                                    |
  | tty: tin      tout   avg-cpu:  % user      % sys   % idle  % iowait|
  |      2.2      3.3                0.4         1.3     97.7    0.6   |
  |                                                                    |
  |  Disks:     % tm_act  Kbps      tps   msps    Kb_read   Kb_wrtn    |
  |  hdisk0     0.4       1.1       0.3           117675    1087266    |
  |  hdisk1     0.3       1.0       0.2           59230     1017734    |
  |  hdisk2     0.0       0.2       0.0           180189    46832      |
  |  cd0        0.0       0.0       0.0           0         0          |
  |                                                                    |
  | tty: tin      tout   avg-cpu:  % user      % sys   % idle  % iowait|
  |      2.2      3.3                0.4         1.3     97.7    0.6   |
  |                                                                    |
  |  Disks:     % tm_act  Kbps      tps     msps    Kb_read Kb_wrtn    |
  |  hdisk0         0.4   1.1       0.3             117675  1087266    |
  |  hdisk1         0.3   1.0       0.2             59230   1017734    |
  |  hdisk2         0.0   0.2       0.0             180189  46832      |
  |  cd0            0.0   0.0       0.0             0       0          |
  |                                                                    |
   --------------------------------------------------------------------
                  Figure 1. Sample Output from iostat 2 2

The following sections explain the output.
  ------------------------------------------------------------------------

TTY statistics in the iostat output

The two columns of TTY information (tin and tout) in the iostat output show
the number of characters read and written by all TTY devices, including
both real and pseudo TTY devices. Real TTY devices are those connected to
an asynchronous port. Some "pseudo TTY devices" are shells, telnet
sessions, and aixterms.

   * tin shows the total characters per second read by all TTY devices.
   * tout indicates the total characters per second written to all TTY
     devices.

Generally there are fewer input characters than output characters. For
example, assume you run the following:

          iostat -t 1 30
          cd /usr/sbin
          ls -l

You will see few input characters and many output characters. On the other
hand, applications such as vi result in a smaller difference between the
number of input and output characters. Analysts using modems for
asynchronous file transfer may notice the number of input characters
exceeding the number of output characters. Naturally, this depends on
whether the files are being sent or received relative to the measured
system.

Since the processing of input and output characters consumes CPU resource,
look for a correlation between increased TTY activity and CPU utilization.
If such a relationship exists, evaluate ways to improve the performance of
the TTY subsystem. Steps that could be taken include changing the
application program, modifying TTY port parameters during file transfer, or
perhaps upgrading to a faster or more efficient asynchronous communications
adapter.

  ------------------------------------------------------------------------

CPU statistics in the iostat output

The CPU statistics columns (% user, % sys, % idle, and % iowait) provide a
breakdown of CPU usage. This information is also reported in the output of
the vmstat command in the columns labeled us, sy, id, and wa.

% user

The % user column shows the percentage of CPU resource spent in user mode.
A UNIX process can execute in user or system mode. When in user mode, a
process executes within its own code and does not require kernel resources.

% sys

The % sys column shows the percentage of CPU resource spent in system mode.
This includes CPU resource consumed by kernel processes (kprocs) and others
that need access to kernel resources. For example, the reading or writing
of a file requires kernel resources to open the file, seek a specific
location, and read or write data. A UNIX process accesses kernel resources
by issuing system calls.

Typically, the CPU is pacing (the system is CPU bound) if the sum of user
and system time exceeds 90 percent of CPU resource on a single-user system
or 80 percent on a multi-user system. This condition means that the CPU is
the limiting factor in system performance.

The ratio of user to system mode is determined by workload and is more
important when tuning an application than when evaluating performance.

A key factor when evaluating CPU performance is the size of the run queue
(provided by the vmstat command). In general, as the run queue increases,
users will notice degradation (an increase) in response time.

% idle

The % idle column shows the percentage of CPU time spent idle, or waiting,
without pending local disk I/O. If there are no processes on the run queue,
the system dispatches a special kernel process called wait. On most AIX
systems, the wait process ID (PID) is 516.

% iowait

The % iowait column shows the percentage of time the CPU was idle with
pending local disk I/O.

The iowait state is different from the idle state in that at least one
process is waiting for local disk I/O requests to complete. Unless the
process is using asynchronous I/O, an I/O request to disk causes the
calling process to block (or sleep) until the request is completed. Once a
process's I/O request completes, it is placed on the run queue.

In general, a high iowait percentage indicates the system has a memory
shortage or an inefficient I/O subsystem configuration. Understanding the
I/O bottleneck and improving the efficiency of the I/O subsystem require
more data than iostat can provide. However, typical solutions might
include:

   * limiting the number of active logical volumes and file systems placed
     on a particular physical disk (The idea is to balance file I/O evenly
     across all physical disk drives.)
   * spreading a logical volume across multiple physical disks (This is
     useful when a number of different files are being accessed.)
   * creating multiple JFS logs for a volume group and assigning them to
     specific file systems (This is beneficial for applications that
     create, delete, or modify a large number of files, particularly
     temporary files.)
   * backing up and restoring file systems to reduce fragmentation
     (Fragmentation causes the drive to seek excessively and can be a large
     portion of overall response time.)
   * adding additional drives and rebalancing the existing I/O subsystem

On systems running a primary application, high I/O wait percentage may be
related to workload. In this case, there may be no way to overcome the
problem. On systems with many processes, some will be running while others
wait for I/O. In this case, the iowait can be small or zero because running
processes "hide" wait time. Although iowait is low, a bottleneck may still
limit application performance. To understand the I/O subsystem thoroughly,
examine the statistics in the next section.

NOTE: The %iowait column on SMP machines running AIX Versions 4.3.2 or
earlier is somewhat exaggerated. This is due to the method used in
calculating wio.

Method used in AIX 4.3.2 and earlier AIX Versions

At each clock interrupt on each processor (100 times a second in AIX), a
determination is made as to which of four categories (usr/sys/wio/idle) to
place the last 10 ms of time. If the CPU was busy in usr mode at the time
of the clock interrupt, then usr gets the clock tick added into its
category. If the CPU was busy in kernel mode at the time of the clock
interrupt, then the sys category gets the tick. If the CPU was NOT busy,
then a check is made to see if any I/O to disk is in progress. If any disk
I/O is in progress, then the wio category is incremented. If NO disk I/O is
in progress and the CPU is not busy, then the idl category gets the tick.
  ------------------------------------------------------------------------

Disk statistics in the iostat output

The disk statistics portion of the iostat output provides a breakdown of
I/O usage. This information is useful in determining whether a physical
disk is limiting performance.

Disk I/O history

The system maintains a history of disk activity by default. Note that
history is disabled if you see the message:

          Disk history since boot not available.

This message displays only in the first output record from iostat.

Disk I/O history should be enabled since the CPU resource used in
maintaining it is insignificant. History-keeping can be disabled or enabled
in SMIT under the following path:

          -> System Environments
              -> Change/Show Characteristics of Operating System
                  -> Continuously maintain DISK I/O history
                      -> true | false

Choose true to enable history-keeping or false to disable it.

Disks

The Disks: column shows the names of the physical volumes. They are either
hdisk or cd followed by a number. (hdisk0 and cd0 refer to the first
physical disk drive and the first CD disk drive, respectively.)

% tm_act

The % tm_act column shows the percentage of time the volume was active.
This is the primary indicator of a bottleneck.

A drive is active during data transfer and command processing, such as
seeking to a new location. The disk-use percentage is directly proportional
to resource contention and inversely proportional to performance. As disk
use increases, performance decreases and response time increases. In
general, when a disk's use exceeds 70 percent, processes are waiting longer
than necessary for I/O to complete because most UNIX processes block (or
sleep) while waiting for their I/O requests to complete.

Kbps

Kbps shows the amount of data read from and written to the drive in KBs per
second. This is the sum of Kb_read plus Kb_wrtn, divided by the number of
seconds in the reporting interval.

tps

tps reports the number of transfers per second. A transfer is an I/O
request at the device driver level.

Kb_read

Kb_read reports the total data (in KBs) read from the physical volume
during the measured interval.

Kb_wrtn

Kb_wrtn shows the amount of data (in KBs) written to the physical volume
during the measured interval.

  ------------------------------------------------------------------------

Analyzing the data

Taken alone, there is no unacceptable value for any of the preceding fields
because statistics are too closely related to application characteristics,
system configuration, and types of physical disk drives and adapters.
Therefore, when evaluating data, you must look for patterns and
relationships. The most common relationship is between disk utilization and
data transfer rate.

To draw any valid conclusions from this data, you must understand the
application's disk data access patterns -- sequential, random, or a
combination -- and the type of physical disk drives and adapters on the
system.

For example, if an application reads and writes sequentially, you should
expect a high disk-transfer rate when you have a high disk-busy rate.
(NOTE: Kb_read and Kb_wrtn can confirm an understanding of an application's
read and write behavior but they provide no information on the data access
patterns.)

Generally you do not need to be concerned about a high disk-busy rate as
long as the disk-transfer rate is also high. However, if you get a high
disk-busy rate and a low data-transfer rate, you may have a fragmented
logical volume, file system, or individual file.

What is a high data-transfer rate? That depends on the disk drive and the
effective data-transfer rate for that drive. You should expect numbers
between the effective sequential and effective random disk-transfer rates.
Below is a chart of effective transfer rates for several common SCSI-1 and
SCSI-2 disk drives.

Table 1.  Effective Transfer Rate (KB/sec), Part 1 of 2
TYPE OF ACCESS     400 MB DRIVE    670 MB DRIVE    857 MB DRIVE
Read-Sequential       1589            1525            2142
Read-Random            241             172             262
Write-Sequential      1185            1108            1588
Write-Random           327             275             367
Table 2.  Effective Transfer Rate (KB/sec), Part 2 of 2
TYPE OF ACCESS     1.2GB DRIVE    1.37GB DRIVE   1.2GB DRIVE  1.37GB S-2 DRIVE
Read-Sequential       2169           2667           2180         3123
Read-Random            292            299            385          288
Write-Sequential      1464           2189           2156         2357
Write-Random           362            491            405          549

The transfer rates were determined during performance testing and give more
accurate expectations of disk performance than the media-transfer rate,
which reflects the hardware capability and does not account for operating
system and application overhead.

Another use of the data is to answer the question: "Do I need another SCSI
adapter?" If you've ever been asked this question, you probably provided a
generic answer or just plain guessed.

You can use data captured by iostat to answer the question accurately by
tracking transfer rates, finding the maximum data-transfer rate for each
disk. Assume that the maximum rate occurs simultaneously for all drives
(the worst case). For maximum aggregate performance, the measured transfer
rates for drives attached to a given adapter must be below the effective
SCSI adapter throughput rating.

For planning purposes, you should use 70 percent of the adapter's rated
throughput (for example, 2.8 MB per second for a SCSI-1 adapter). This
percentage should provide a sufficient buffer for occasional peak rates
that may occur. When adding a drive, you must assume the data-transfer
rate. At least you will have the collected data and the effective transfer
rates to use as a basis.

Keep in mind that the SCSI adapter may be saturated if the data-transfer
rates over multiple intervals approach the effective SCSI adapter
throughput rating. In that case, the preceding analysis is invalid.

  ------------------------------------------------------------------------

Conclusions

The primary purpose of the iostat tool is to detect I/O bottlenecks by
monitoring the disk utilization (%tm_act field). iostat can also be used to
identify CPU problems, assist in capacity planning, and provide insight
into solving I/O problems. Armed with both vmstat and iostat, you can
capture the data required to identify performance problems related to CPU,
memory, and I/O subsystems.

[ TechDocs Ref: 90605205314704     Publish Date: Oct. 15, 1999     4FAX
Ref: 9779 ]

The AIX Virtual Memory Manager (VMM)

------------------------------------------------------------------------

Contents

About this document
VMM overview
Real-Memory management
Free list
Persistent vs. memory segments
Paging space and virtual memory
VMM memory load control facility
VMSTAT's avm field
VMSTAT's fre field
How the system is using memory
Explanation of svmon output
  ------------------------------------------------------------------------

About this document

This document addresses how RAM and paging space are used. This information
applies to AIX Version 4.x.
  ------------------------------------------------------------------------

VMM overview

The Virtual Memory Manager (VMM) services memory requests from the system
and its applications. Virtual-memory segments are partitioned in units
called pages; each page is either located in physical memory (RAM) or
stored on disk until it is needed. AIX uses virtual memory in order to
address more memory than is physically available in the system. The
management of memory pages in RAM or on disk is handled by the VMM.
  ------------------------------------------------------------------------

Real-Memory management

In AIX, virtual-memory segments are partitioned into 4096-byte units called
pages. Real memory is divided into 4096-byte page frames. The VMM has two
major functions: 1) manage the allocation of page frames, and 2) resolve
references to virtual-memory pages that are not currently in RAM (stored in
paging space) or do not yet exist.

In order to accomplish its task, the VMM maintains a free list of available
page frames. The VMM also uses a page-replacement algorithm to determine
which virtual-memory pages currently in RAM will have their page frames
reassigned to the free list. The page-replacement algorithm takes into
account the existence of persistent vs. working segments, repaging, and VMM
thresholds.
  ------------------------------------------------------------------------

Free list

The VMM maintains a list of free page frames that it uses to satisfy page
faults. The free list is made up of unallocated page frames. AIX tries to
use all of RAM all of the time, except for a small amount which it
maintains on the free list. To maintain this small amount of unallocated
pages the VMM will use Page Outs and Page Steals to free up space and
reassign those page frames to the free list. The virtual-memory pages whose
page frames are to be reassigned are selected via the VMM's
page-replacement algorithm.
  ------------------------------------------------------------------------

Persistent vs. working memory segments

AIX distinguishes between different types of memory segments, and to
understand the Virtual Memory Manager, it is important to understand the
difference between working and persistent segments. A persistent segment
has a permanent storage location on disk. Files containing data or
executable programs are mapped to persistent segments. When a Journaled
File System (JFS) file is opened and accessed the file data is copied into
RAM. VMM parameters control when physical memory frames allocated to
persistent pages should be overwritten and used to store other data.

Working segments are transitory and exist only during their use by a
process, and have no permanent disk storage location. Process stack and
data regions are mapped to working segments and shared library text
segments. Pages of working segments must also have disk storage locations
to occupy when they cannot be kept in real memory. The disk paging space is
used for this purpose. When a program exits, all of its working pages are
placed back on the free list immediately.
  ------------------------------------------------------------------------

Paging space and virtual memory

Working pages in RAM that can be modified and paged out are assigned a
corresponding slot in paging space. The allocated paging space will only be
used if the page needs to be paged out. However, an allocated page in
paging space cannot be used by another page. It remains reserved for a
particular page for as long as it exists in virtual-memory. Since
persistent pages are paged out to their location on disk from which they
came, paging space does not need to be allocated for persistent pages
residing in RAM.

The VMM has three modes for allocating paging space: early paging space
allocation, late paging space allocation and deferred paging space
allocation. Early allocation policy reserves paging space whenever a memory
request for a working page is made. Late allocation policy only assigns
paging space when the working page is referenced. AIX Versions 4.3.2 and
later implement a deferred paging space allocation policy. With deferred
paging space policy, paging space blocks are not allocated until the
working pages are actually paged out of memory. This significantly reduces
the paging space requirements of the system.
  ------------------------------------------------------------------------

VMM memory load control facility

When a process references a virtual-memory page that is on disk, because it
either has been paged out or has never been read, the referenced page must
be paged in, and this may cause one or more pages to be paged out if the
number of available free page frames is low. The VMM attempts to steal page
frames that have not been recently referenced, and thus unlikely to be
referenced in the near future, via the page-replacement algorithm.

A successful page-replacement keeps the memory pages of all currently
active processes in RAM, while the memory pages of inactive processes are
paged out. However, when RAM is over-committed, it becomes difficult to
choose pages for page out because they will be referenced in the near
future by currently running processes. The result is that pages that will
soon be referenced still get paged out and then paged in again later. When
this happens, continuous paging in and paging out may occur if RAM is
over-committed. This condition is called thrashing. The system spends most
of its time paging in and paging out instead of executing useful
instructions, and none of the active processes make any significant
progress. The VMM has a memory load control algorithm that detects when the
system is thrashing and then attempts to correct the condition.
  ------------------------------------------------------------------------

VMSTAT'S avm field

avm stands for "Active Virtual Memory" and not "Available Memory". The avm
value in VMSTAT indicates the number of virtual-memory pages that have been
accessed but not necessarily paged out. With the previous policy of late
page space allocation, avm had the same definition but since the VMM would
allocate paging space disk blocks for each working page that was accessed,
then the paging space blocks were equal to the avm. With deferred policy,
the page space disk blocks are only allocated for the pages that need to be
paged out. The avm number will grow as more processes get started and/or
existing processes use more working storage. Likewise, the number will
shrink as processes exit and/or free working storage.
  ------------------------------------------------------------------------

VMSTAT'S fre field

fre is the number of 4K pages that are currently on the free list. When an
application terminates, all of its working pages are immediately returned
to the free list. Its persistent pages, however, remain in RAM and are not
added back to the free list until they are stolen by the VMM for other
programs. Persistent pages are also freed if the corresponding file is
deleted.

For this reason, the fre value may not indicate all the real memory that
can be readily available for use by processes. If a page frame is needed,
then persistent pages related to terminated applications are among the
first to be handed over to another program.

The minimum number of pages that the Virtual Memory Manager keeps on the
free list is determined by the minfree parameter of vmtune. If the number
of pages on the free list drops below minfree, the Virtual Memory Manager
will steal pages until the free list has been restored to the maxfree
value.
  ------------------------------------------------------------------------

How the system is using memory

The svmon command can be used to determine roughly how much memory the
system is using.

NOTE: PAIDE/6000 must be installed in order to use svmon. Check to see if
this is installed, by executing the following command:

   $ lslpp -l perfagent.tools.

If you are at AIX Version 4.3.0 or higher, then this file can be found on
the AIX Base Operating System media. Otherwise, to order PAIDE/6000, call
IBM DIRECT 1-800-426-2255 or contact your local IBM representative.

As root, type svmon. Under the pgspace heading, the inuse field is the
number of working pages that are in use in all of virtual memory.
  ------------------------------------------------------------------------

Explanation of svmon output

memory:

SIZE total size of memory in 4K pages

INUSE number of pages in RAM that are in use by a process plus the number
of persistent pages that belonged to a terminated process and are still
resident in RAM. This value is the total size of memory minus the number of
pages on the free list.

FREE number of pages on free list.

PIN number of pages pinned in RAM (a pinned page is a page that is always
resident in RAM and cannot be paged out)

in use:

WORK number of working pages in RAM

PERS number of persistent pages in RAM

CLNT number of client pages in RAM (client page is a remote file page)

pin:

WORK number of working pages pinned in RAM

PERS number of persistent pages pinned in RAM

CLNT number of client pages pinned in RAM

pgspace:

SIZE total size of paging space in 4K pages

INUSE total number of allocated slots. (See explanation above on allocation
of paging space).

To find out how much memory a process is using, type

     $ svmon -P PID      (for one process)
                              or
     $ svmon -Pau | more      (for all processes)

To see the number of working pages unique to this process' private stack
and data use in all of virtual memory, look at the work type and
description private. The svmon output may also list several shared
segments. For a complete picture, determine which segments are unique to an
individual process and which are shared with other programs. Multiply the
values by 4096 to get the number of bytes in memory the process is using.
The number 4096 comes from the fact that each page is 4KB in size. You can
also divide the number of pages by 256 in order to get megabytes.

[ TechDocs Ref: 90605226714706     Publish Date: May. 05, 2000     4FAX
Ref: 2449 ]

Bruce Spencer,
baspence@us.ibm.com

July 6, 2000