[ Previous | Next | Table of Contents | Index | Library Home |
Legal |
Search ]
Performance Management Guide
Unless you are purchasing a software package that comes with detailed
resource-requirement documentation, estimating resources can be the most
difficult task in the performance-planning process. The difficulty has
several causes, as follows:
- There are several ways to do any task. You can write a C (or other
high-level language) program, a shell script, a perl script, an
awk script, a sed script, an AIXwindows dialog, and so
on. Some techniques that may seem particularly suitable for the
algorithm and for programmer productivity are extraordinarily expensive from
the performance perspective.
A useful guideline is that, the higher the level of abstraction, the more
caution is needed to ensure that one does not receive a performance
surprise. Consider carefully the data volumes and number of iterations
implied by some apparently harmless constructs.
- The precise cost of a single process is difficult to define. This
difficulty is not merely technical; it is philosophical. If
multiple instances of a given program run by multiple users are sharing pages
of program text, which process should be charged with those pages of memory?
The operating system leaves recently used file pages in memory to provide a
caching effect for programs that reaccess that data. Should programs
that reaccess data be charged for the space that was used to retain the data?
The granularity of some measurements such as the system clock can cause
variations in the CPU time attributed to successive instances of the same
program.
Two approaches deal with resource-report ambiguity and variability.
The first is to ignore the ambiguity and to keep eliminating sources of
variability until the measurements become acceptably consistent. The
second approach is to try to make the measurements as realistic as possible
and describe the results statistically. Note that the latter yields
results that have some correlation with production situations.
- Systems are rarely dedicated to running a single instance of a single
program. There are almost always daemons running, there is frequently
communications activity, and often workload from multiple users. These
activities seldom add up linearly. For example, increasing the number
of instances of a given program may result in few new program text pages being
used, because most of the program was already in memory. However, the
additional processes may result in more contention for the processor's
caches, so that not only do the other processes have to share processor time
with the newcomer, but all processes may experience more cycles per
instruction. This is, in effect, a slowdown of the processor, as a
result of more frequent cache misses.
Make your estimate as realistic as the specific situation allows, using the
following guidelines:
- If the program exists, measure the existing installation that most closely
resembles your own requirements. The best method is to use a capacity
planning tool such as BEST/1.
- If no suitable installation is available, do a trial installation and
measure a synthetic workload.
- If it is impractical to generate a synthetic workload that matches the
requirements, measure individual interactions and use the results as input to
a simulation.
- If the program does not exist yet, find a comparable program that uses the
same language and general structure, and measure it. Again, the more
abstract the language, the more care is needed in determining
comparability.
- If no comparable program exists, develop a prototype of the main
algorithms in the planned language, measure the prototype, and model the
workload.
- Only if measurement of any kind is impossible or infeasible should you
make an educated guess. If it is necessary to guess at resource
requirements during the planning stage, it is critical that the actual program
be measured at the earliest possible stage of its development.
Keep in mind that independent software vendors (ISV) often have sizing
guidelines for their applications.
In estimating resources, we are primarily interested in four dimensions (in
no particular order):
- CPU time
- Processor cost of the workload
- Disk accesses
- Rate at which the workload generates disk reads or writes
- LAN traffic
- Number of packets the workload generates and the number of bytes of data
exchanged
- Real memory
- Amount of RAM the workload requires
The following sections discuss how to determine these values in various
situations.
If the real program, a comparable program, or a prototype is available for
measurement, the choice of technique depends on the following:
- Whether the system is processing other work in addition to the workload we
want to measure.
- Whether we have permission to use tools that may degrade performance (for
example, is this system in production or is it dedicated to our use for the
duration of the measurement?).
- The degree to which we can simulate or observe an authentic
workload.
Using a dedicated system is the ideal situation because we can use
measurements that include system overhead as well as the cost of individual
processes.
To measure overall system performance for most of the system activity, use
the vmstat command:
# vmstat 5 >vmstat.output
This gives us a picture of the state of the system every 5 seconds during
the measurement run. The first set of vmstat output contains
the cumulative data from the last boot to the start of the vmstat
command. The remaining sets are the results for the preceding interval,
in this case 5 seconds. A typical set of vmstat output on a
system looks similar to the following:
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
0 1 75186 192 0 0 0 0 1 0 344 1998 403 6 2 92 0
To measure CPU and disk activity, use the iostat command:
# iostat 5 >iostat.output
This gives us a picture of the state of the system every 5 seconds during
the measurement run. The first set of iostat output contains
the cumulative data from the last boot to the start of the iostat
command. The remaining sets are the results for the preceding interval,
in this case 5 seconds. A typical set of iostat output on a
system looks similar to the following:
tty: tin tout avg-cpu: % user % sys % idle % iowait
0.0 0.0 19.4 5.7 70.8 4.1
Disks: % tm_act Kbps tps Kb_read Kb_wrtn
hdisk0 8.0 34.5 8.2 12 164
hdisk1 0.0 0.0 0.0 0 0
cd0 0.0 0.0 0.0 0 0
To measure memory, use the svmon command. The svmon
-G command gives a picture of overall memory use. The statistics
are in terms of 4 KB pages (example from AIX 4.3.3):
# svmon -G
size inuse free pin virtual
memory 65527 65406 121 5963 74711
pg space 131072 37218
work pers clnt
pin 5972 0 0
in use 54177 9023 2206
In this example, the machine's 256 MB memory is fully used.
About 83 percent of RAM is in use for working segments, the read/write memory
of running programs (the rest is for caching files). If there are
long-running processes in which we are interested, we can review their memory
requirements in detail. The following example determines the memory
used by a process of user hoetzel.
# ps -fu hoetzel
UID PID PPID C STIME TTY TIME CMD
hoetzel 24896 33604 0 09:27:35 pts/3 0:00 /usr/bin/ksh
hoetzel 32496 25350 6 15:16:34 pts/5 0:00 ps -fu hoetzel
# svmon -P 24896
------------------------------------------------------------------------------
Pid Command Inuse Pin Pgsp Virtual 64-bit Mthrd
24896 ksh 4592 1456 2711 5615 N N
Vsid Esid Type Description Inuse Pin Pgsp Virtual Addr Range
4411 d work shared library text 2619 0 1187 1315 0..65535
0 0 work kernel seg 1908 1455 1399 4171 0..32767 :
65475..65535
703c 1 pers code,/dev/hd2:4188 58 0 - - 0..58
5176 2 work process private 4 1 104 108 0..129 :
65309..65535
1465 - pers /dev/hd2:16866 2 0 - - 0..1
4858 - pers /dev/hd2:8254 1 0 - - 0..0
18c5 - pers /dev/andy:207 0 0 - - 0..0
783d f work shared library data 0 0 21 21 0..3492
The working segment (5176), with 4 pages in use, is the cost of this
instance of the ksh program. The 2619-page cost of the
shared library and the 58-page cost of the ksh program are spread
across all of the running programs and all instances of the ksh
program, respectively.
If we believe that our 256 MB system is larger than necessary, use the
rmss command to reduce the effective size of the machine and
remeasure the workload. If paging increases significantly or response
time deteriorates, we have reduced memory too much. This technique can
be continued until we find a size that runs our workload without
degradation. See Assessing Memory Requirements Through
the rmss Command for more information on this technique.
The primary command for measuring network usage is the netstat
program. The following example shows the activity of a specific
Token-Ring interface:
# netstat -I tr0 5
input (tr0) output input (Total) output
packets errs packets errs colls packets errs packets errs colls
35552822 213488 30283693 0 0 35608011 213488 30338882 0 0
300 0 426 0 0 300 0 426 0 0
272 2 190 0 0 272 2 190 0 0
231 0 192 0 0 231 0 192 0 0
143 0 113 0 0 143 0 113 0 0
408 1 176 0 0 408 1 176 0 0
The first line of the report shows the cumulative network traffic since the
last boot. Each subsequent line shows the activity for the preceding
5-second interval.
The techniques of measurement on production systems are similar to those on
dedicated systems, but we must be careful to avoid degrading system
performance.
Probably the most cost-effective tool is the vmstat command,
which supplies data on memory, I/O, and CPU usage in a single report.
If the vmstat intervals are kept reasonably long, for example, 10
seconds, the average cost is relatively low. See Identifying the Performance-Limiting Resource for more
information on using the vmstat command.
By partial workload, we mean measuring a part of the production
system's workload for possible transfer to or duplication on a different
system. Because this is a production system, we must be as unobtrusive
as possible. At the same time, we must analyze the workload in more
detail to distinguish between the parts we are interested in and those we are
not. To do a partial measurement, we must discover what the workload
elements of interest have in common. Are they:
- The same program or a small set of related programs?
- Work performed by one or more specific users of the system?
- Work that comes from one or more specific terminals?
Depending on the commonality, we could use one of the following:
# ps -ef | grep pgmname
# ps -fuusername, . . .
# ps -ftttyname, . . .
to identify the processes of interest and report the cumulative CPU time
consumption of those processes. We can then use the svmon
command (judiciously) to assess the memory use of the processes.
Many tools are available for measuring the resource consumption of
individual programs. Some of these programs are capable of more
comprehensive workload measurements as well, but are too intrusive for use on
production systems. Most of these tools are discussed in depth in the
chapters that discuss tuning for minimum consumption of specific
resources. Some of the more prominent are:
- svmon
- Measures the real memory used by a process. Discussed in Determining How Much Memory Is Being Used.
- time
- Measures the elapsed execution time and CPU consumption of an individual
program. Discussed in Using the time Command to
Measure CPU Use.
- tprof
- Measures the relative CPU consumption of programs, subroutine libraries,
and the operating system's kernel. Discussed in Using the tprof Program to Analyze Programs for CPU
Use.
- vmstat -s
- Measures the I/O load generated by a program. Discussed in Assessing Overall Disk I/O with the vmstat Command.
It is impossible to make precise estimates of unwritten programs.
The invention and redesign that take place during the coding phase defy
prediction, but the following guidelines can help you to get a general sense
of the requirements. As a starting point, a minimal program would need
the following:
- About 50 milliseconds of CPU time, mostly system time.
- Real Memory
- One page for program text
- About 15 pages (of which 2 are pinned) for the working (data) segment
- Access to libc.a. Normally this is shared with
all other programs and is considered part of the base cost of the operating
system.
- About 12 page-in Disk I/O operations, if the program has not been
compiled, copied, or used recently. Otherwise, none required.
To the above, add the basic cost allowances for demands implied by the
design (the units given are for example purposes only):
- CPU time
- The CPU consumption of an ordinary program that does not contain high
levels of iteration or costly subroutine calls is almost unmeasurably
small.
- If the proposed program contains a computationally expensive algorithm,
develop a prototype and measure the algorithm.
- If the proposed program uses computationally expensive library
subroutines, such as X or Motif constructs or the printf()
subroutine, measure their CPU consumption with otherwise trivial
programs.
- Real Memory
- Allow approximately 350 lines of code per page of program text, which is
about 12 bytes per line. Keep in mind that coding style and compiler
options can make a difference of a factor or two in either direction.
This allowance is for pages that are touched in your typical scenario.
If your design places infrequently executed subroutines at the end of the
executable program, those pages do not normally consume real memory.
- References to shared libraries other than libc.a
increase the memory requirement only to the extent that those libraries are
not shared with other programs or instances of the program being
estimated. To measure the size of these libraries, write a trivial,
long-running program that refers to them and use the svmon -P
command against the process.
- Estimate the amount of storage that will be required by the data
structures identified in the design. Round up to the nearest
page.
- In the short run, each disk I/O operation will use one page of
memory. Assume that the page has to be available already. Do not
assume that the program will wait for another program's page to be
freed.
- Disk I/O
- For sequential I/O, each 4096 bytes read or written causes one I/O
operation, unless the file has been accessed recently enough that some of its
pages are still in memory.
- For random I/O, each access, however small, to a different 4096-byte page
causes one I/O operation, unless the file has been accessed recently enough
that some of its pages are still in memory.
- Each sequential read or write of a 4 KB page in a large file takes about
100 units. Each random read or write of a 4 KB page takes about 300
units. Remember that real files are not necessarily stored sequentially
on disk, even though they are written and read sequentially by the
program. Consequently, the typical CPU cost of an actual disk access
will be closer to the random-access cost than to the sequential-access
cost.
- Communications I/O
- If disk I/O is actually to Network File System (NFS) remote-mounted file
systems, the disk I/O is performed on the server, but the client experiences
higher CPU and memory demands.
- RPCs of any kind contribute substantially to the CPU load. The
proposed RPCs in the design should be minimized, batched, prototyped, and
measured in advance.
- Each sequential NFS read or write of an 4 KB page takes about 600 units on
the client. Each random NFS read or write of a 4 KB page takes about
1000 units on the client.
- Web browsing and Web serving implies considerable network I/O, with TCP
connections opening and closing quite frequently.
The best method for estimating peak and typical resource requirements is to
use a queuing model such as BEST/1. Static models can be used, but you
run the risk of overestimating or underestimating the peak resource. In
either case, you need to understand how multiple programs in a workload
interact from the standpoint of resource requirements.
If you are building a static model, use a time interval that is the
specified worst-acceptable response time for the most frequent or demanding
program (usually they are the same). Determine which programs will
typically be running during each interval, based on your projected number of
users, their think time, their key entry rate, and the anticipated mix of
operations.
Use the following guidelines:
- CPU time
- Add together the CPU requirements for the all of the programs that are
running during the interval. Include the CPU requirements of the disk
and communications I/O the programs will be doing.
- If this number is greater than 75 percent of the available CPU time during
the interval, consider fewer users or more CPUs.
- Real Memory
- The operating system memory requirement scales with the amount of physical
memory. Start with 6 to 8 MB for the operating system itself.
The lower figure is for a standalone system. The latter figure is for a
system that is LAN-connected and uses TCP/IP and NFS.
- Add together the working segment requirements of all of the instances of
the programs that will be running during the interval, including the space
estimated for the program's data structures.
- Add to that total the memory requirement of the text segment of each
distinct program that will be running (one copy of the program text serves all
instances of that program). Remember that any (and only) subroutines
that are from unshared libraries will be part of the executable program, but
the libraries themselves will not be in memory.
- Add to the total the amount of space consumed by each of the shared
libraries that will be used by any program in the workload. Again, one
copy serves all.
- To allow adequate space for some file caching and the free list, your
total memory projection should not exceed 80 percent of the size of the
machine to be used.
- Disk I/O
- Add the number of I/Os implied by each instance of each program.
Keep separate totals for I/Os to small files (or randomly to large files)
versus purely sequential reading or writing of large files (more than 32
KB).
- Subtract those I/Os that you believe will be satisfied from memory.
Any record that was read or written in the previous interval is probably still
available in the current interval. Beyond that, examine the size of the
proposed machine versus the total RAM requirements of the machine's
workload. Any space remaining after the operating system's
requirement and the workload's requirements probably contains the most
recently read or written file pages. If your application's design
is such that there is a high probability that you will reuse recently accessed
data, you can calculate an allowance for the caching effect. Remember
that the reuse is at the page level, not at the record level. If the
probability of reuse of a given record is low, but there are a lot of records
per page, it is likely that some of the records needed in any given interval
will fall in the same page as other, recently used, records.
- Compare the net I/O requirements (disk I/Os per second per disk) to the
approximate capabilities of current disk drives. If the random or
sequential requirement is greater than 75 percent of the total corresponding
capability of the disks that will hold application data, tuning (and possibly
expansion) will be needed when the application is in production.
- Communications I/O
- Calculate the bandwidth consumption of the workload. If the total
bandwidth consumption of all of the nodes on the LAN is greater than 70
percent of nominal bandwidth (50 percent for Ethernet), you might want to use
a network with higher bandwidth.
- Perform a similar analysis of CPU, memory, and I/O requirements of the
added load that will be placed on the server.
Note: Remember that these guidelines are intended for use
only when no extensive measurement is possible. Any
application-specific measurement that can be used in place of a guideline will
considerably improve the accuracy of the estimate.
[ Previous | Next | Table of Contents | Index |
Library Home |
Legal |
Search ]