Performance Management Guide

Using the tprof Program to Analyze Programs for CPU Use

The typical program execution is a variable mixture of application code, library subroutines, and kernel services. Frequently, a program that has not yet been tuned is found to expend most of its CPU cycles in a few statements or subroutines. Often these hot spots are a surprise to the programmer. They often can be considered performance problems. Use the tprof command to pinpoint any hot spots (for additional information see The tprof Command). The tprof command can profile any program produced by one of the compilers: C, C++, and FORTRAN.

To determine whether the tprof program is installed and available, run the following command:

# lslpp -lI bos.perf.tools

The raw data for the tprof program is obtained through the trace facility (see Analyzing Performance with the Trace Facility). When a program is profiled, the trace facility is activated and instructed to collect data from the trace hook (hook ID 234) that records the contents of the Instruction Address Register when a system-clock interrupt occurs (100 times a second per processor). Several other trace hooks are also activated to allow the tprof program to track process and dispatch activity. The trace records are not written to a disk file; they are written to a pipe that is read by a program that builds a table of the unique program addresses that have been encountered and the number of times each one occurred. When the workload being profiled is complete, the table of addresses and their occurrence counts is written to disk. The data-reduction component of the tprof program then correlates the instruction addresses that were encountered with the ranges of addresses occupied by the various programs and reports the distribution of address occurrences (ticks) across the programs involved in the workload.

The distribution of ticks is roughly proportional to the CPU time spent in each program (10 milliseconds per tick). After the high-use programs have been identified, the programmer can take action to restructure their hot spots or minimize their use.

A tprof Example for releases prior to AIX 5.2

The following C program initializes each byte of a large array of integers to 0x01, increments each integer by a random constant, and prints out a randomly selected integer. The program is representative of programs that process large arrays.

/*  Array Incrementer -- Version 1  */
#include <stdlib.h>
#define Asize 1024
#define RowDim InnerIndex
#define ColDim OuterIndex
main()
{
  int Increment;
  int OuterIndex;
  int InnerIndex;
  int big [Asize][Asize];
  /* initialize every byte of the array to 0x01 */
  for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
  {
    for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
      big[RowDim][ColDim] = 0x01010101;
  }
  Increment = rand();
  /* increment every element in the array */
  for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
  {
    for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
    {
      big[RowDim][ColDim] += Increment;
      if (big[RowDim][ColDim] < 0)
       printf("Negative number. %d\n",big[RowDim][ColDim]);
    }
  }
  printf("Version 1 Check Num: %d\n",
        big[rand()%Asize][rand()%Asize]);
  return(0);
}

The program was compiled with the following command:

# xlc -g version1.c -o version1

The -g parameter causes the C compiler to generate the object module with symbolic debugging information for use by the tprof program. Although the tprof program can profile optimized modules, the -O parameter has been omitted to make the line numbers that the tprof program uses more precise. When the C compiler is optimizing, it often does enough rearrangement of code to make the output of the tprof program harder to interpret. On the test system, this program runs in about 5.97 seconds of elapsed time, of which more than 5.9 seconds is user CPU time. The program clearly meets its objective of being CPU-limited.

We can profile the program with the following command (include the -m option on operating ystem versions later than AIX 4.3.3):

# tprof -p version1 -x version1

A file called __version1.all (shown below) is created. It reports how many CPU ticks each of the programs involved in the execution consumed.

          Process      PID      TID    Total   Kernel     User   Shared    Other
          =======      ===      ===    =====   ======     ====   ======    =====
         version1    30480    30481      793       30      763        0        0
              ksh    32582    32583        8        8        0        0        0
        /etc/init        1      459        6        0        6        0        0
       /etc/syncd     3854     4631        5        5        0        0        0
            tprof     5038     5019        4        2        2        0        0
          rlogind    11344    15115        2        2        0        0        0
          PID.771      770      771        1        1        0        0        0
            tprof    11940    11941        1        1        0        0        0
            tprof    11950    11951        1        1        0        0        0
            tprof    13986    15115        1        1        0        0        0
              ksh    16048     7181        1        1        0        0        0
          =======      ===      ===    =====   ======     ====   ======    =====
            Total                        823       52      771        0        0
          Process     FREQ    Total   Kernel     User   Shared    Other
          =======      ===    =====   ======     ====   ======    =====
         version1        1      793       30      763        0        0
              ksh        2        9        9        0        0        0
        /etc/init        1        6        0        6        0        0
       /etc/syncd        1        5        5        0        0        0
            tprof        4        7        5        2        0        0
          rlogind        1        2        2        0        0        0
          PID.771        1        1        1        0        0        0
          =======      ===    =====   ======     ====   ======    =====
            Total       11      823       52      771        0        0
   Total Ticks For version1(    USER) =    763
           Subroutine   Ticks     %            Source   Address  Bytes
        =============  ======  ======         =======   =======  =====
                .main     763    92.7      version1.c       632    560

The first section of the tprof report shows the number of ticks consumed by, or on behalf of, each process. The program version1 used 763 ticks itself, and 30 ticks occurred in the kernel on behalf of version1's process. Two processes running the Bourne shell were involved in the execution of version1. Four processes were running tprof-related code. The init process, the sync daemon, an rlogin process, and one other process accounted for 14 ticks.

Remember that the program associated with a given numerical process ID changes with each exec() subroutine call. If one application program uses the exec() subroutine to execute another, both program names will appear in the tprof output associated with the same process ID.

The second section of the report summarizes the results by program, regardless of process ID. It shows the number (FREQ) of different processes that ran each program at some point.

Up to this point, none of the tprof processing has required access to the specially compiled version of the program. You could have done the preceding analysis on a program for which you did not have access to the source code.

It is clear from this report that the main CPU consumption (92.7 percent) is in the program itself, not in the kernel nor in library subroutines that the program uses. You must examine the program itself more closely.

Because you compiled version1.c with the -g option, the object file contains information that relates offsets in the program text to lines of source code. Consequently, the tprof program created an annotated version of the source file version1.c, called __t.version1.c, based on the offsets and line number information in the object module. The first column is the line number. The second column is the number of times the trace hook reported that the timer interrupt occurred while the system was executing one of the instructions associated with that line.

Ticks Profile for main in version1.c

   Line   Ticks   Source

    14     34       for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
    15      -       {
    16     40         for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
    17    261           big[RowDim][ColDim] = 0x01010101;
    18      -       }
    19      -       Increment = rand();
    20      -
    21      -       /* increment every element in the array */
    22     70       for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
    23      -       {
    24      -         for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
    25      -         {
    26     69           big[RowDim][ColDim] += Increment;
    27     50           if (big[RowDim][ColDim] < 0)
    28    239            printf("Negative number.%d\n",
                                 big[RowDim][ColDim]);
    29      -         }
    30      -       }
    31      -       printf("Version 1 Check Num: %d\n",
    32      -             big[rand()%Asize][rand()%Asize]);
    33      -       return(0);
    34      -     }

 763 Total Ticks for main in version1.c

This file shows that the largest numbers of ticks are associated with accessing elements of the array big, so you should be able to enhance performance significantly by concentrating on the inner for loops. The first (initialization) for loop is a case of inefficient programming, because it initializes the array one element at a time. If you were setting the array to 0, use the bzero() subroutine. Because you are setting each byte to a specific character, use the memset() subroutine to replace the first for loop. (The efficient bzero() and memset() functions, like the str*() functions, are written in assembler language and use hardware instructions that have no direct equivalent in the C language.)

You must access the array one element at a time to increment the values, but ensure that the pattern of memory reference is to consecutive addresses, to maximize cache use. In this case, you have the row dimension changing faster than the column dimension. Because C arrays are arranged in row-major order, you are skipping over a complete row with each successive memory reference. Because the rows are 1024 integers long (4096 bytes), you are changing pages on every reference. The size of the array greatly exceeds both the data cache and data Translation Lookaside Buffer (TLB) capacities, so you have written a program for maximum cache and TLB thrashing. To fix this problem, transpose the two #define statements to reverse the values of RowDim and ColDim.

The unoptimized form of the resulting program (version2.c) consumes about 2.7 CPU seconds, compared with 7.9 CPU seconds for program version1.

The following file, __t.version2.c, is the result of a tprof run against the unoptimized form:

Ticks Profile for main in version2.c

   Line   Ticks   Source

    15      -       memset(big,0x01,sizeof(big));
    16      -       Increment = rand();
    17      -
    18      -       /* increment in memory order */
    19     60       for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
    20      -       {
    21      -         for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
    22      -         {
    23     67          big[RowDim][ColDim] += Increment;
    24     60          if (big[RowDim][ColDim] < 0)
    25     43          printf("Negative number. %d\n",big[RowDim][ColDim]);
    26      -         }
    27      -       }
    28      -       printf("Version 2 Check Num: %d\n",
    29      -                   big[rand()%Asize][rand()%Asize]);
    30      -       return(0);
    31      -     }

 230 Total Ticks for main in version2.c

By knowing its CPU use pattern, you have improved the CPU speed of this program by a factor of almost three, for the unoptimized case. When you compile version1.c and version2.c with optimization and compare their performance, the "before and after" improvement due to the changes is a factor of 7.

In many cases, most of a program's CPU use will occur in the library subroutines it uses rather than in the program itself. If you take version2.c and remove the conditional test on line 24 and the printf() entry on line 28, to create a version3.c that reads as follows:

#include <string.h>
#include <stdlib.h>
#define Asize 256
#define RowDim OuterIndex
#define ColDim InnerIndex

main()
{
  int Increment;
  int OuterIndex;
  int InnerIndex;
  int big [Asize][Asize];

  /* Initialize every byte to 0x01 */
  memset(big,0x01,sizeof(big));
  Increment = rand();
  /* increment in memory order */
  for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
  {
    for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
    {
      big[RowDim][ColDim] += Increment;
      printf("RowDim=%d, ColDim=%d, Number=%d\n",
              RowDim, ColDim, big[RowDim][ColDim]);
    }
  }
  return(0);
}

the execution time becomes dominated by the printf() statement. The command:

# tprof -v -s -k -p version3 -x version3 >/dev/null

produces a __version3.all that includes profiling data for the kernel and the shared subroutine library libc.a (the only shared library this program uses):

          Process      PID      TID    Total   Kernel     User   Shared    Other
          =======      ===      ===    =====   ======     ====   ======    =====
         version3    28372    28373      818       30       19      769        0
              ksh    27348    27349        5        5        0        0        0
            tprof    15986    19785        3        1        2        0        0
            tprof     7784     8785        1        1        0        0        0
            tprof    12904    13657        1        1        0        0        0
              ksh    13940    13755        1        1        0        0        0
          =======      ===      ===    =====   ======     ====   ======    =====
            Total                        829       39       21      769        0

          Process     FREQ    Total   Kernel     User   Shared    Other
          =======      ===    =====   ======     ====   ======    =====
         version3        1      818       30       19      769        0
              ksh        2        6        6        0        0        0
            tprof        3        5        3        2        0        0
          =======      ===    =====   ======     ====   ======    =====
            Total        6      829       39       21      769        0

   Total Ticks For version3(    USER) =     19

           Subroutine   Ticks     %            Source   Address  Bytes
        =============  ======  ======         =======   =======  =====
                .main      11     1.3      version3.c       632    320
              .printf       8     1.0         glink.s      1112     36

   Total Ticks For version3(  KERNEL) =     30

           Subroutine   Ticks     %            Source   Address  Bytes
        =============  ======  ======         =======   =======  =====
             .sc_flih       7     0.8           low.s     13832   1244
            .i_enable       5     0.6           low.s     21760    256
            .vmcopyin       3     0.4        vmmove.c    414280    668
         .xix_setattr       2     0.2     xix_sattr.c    819368    672
          .isreadonly       2     0.2        disubs.c    689016     60
               .lockl       2     0.2         lockl.s     29300    208
            .v_pagein       1     0.1    v_getsubs1.c    372288   1044
             .curtime       1     0.1         clock.s     27656     76
             .trchook       1     0.1          noname     48168    856
               .vmvcs       1     0.1         vmvcs.s     29744   2304
           .spec_rdwr       1     0.1    spec_vnops.c    629596    240
                .rdwr       1     0.1          rdwr.c    658460    492
               .imark       1     0.1         isubs.c    672024    184
               .nodev       1     0.1     devsw_pin.c    135864     32
           .ld_findfp       1     0.1      ld_libld.c    736084    240

   Total Ticks For version3( SH-LIBs) =    769

        Shared Object   Ticks     %            Source   Address  Bytes
        =============  ======  ======         =======   =======  =====
         libc.a/shr.o     769    92.0        /usr/lib    794624 724772

   Profile: /usr/lib/libc.a  shr.o

   Total Ticks For version3(/usr/lib/libc.a) =    769

           Subroutine   Ticks     %            Source   Address  Bytes
        =============  ======  ======         =======   =======  =====
             ._doprnt     476    56.9        doprnt.c     36616   7052
              .fwrite     205    24.5        fwrite.c     50748    744
              .strchr      41     4.9        strchr.s     31896    196
              .printf      18     2.2        printf.c    313796    144
             ._moveeq      16     1.9        memcmp.s     36192    184
              .strlen      10     1.2      strerror.c     46800    124
              .isatty       1     0.1        isatty.c     62932    112
             ._xwrite       1     0.1        flsbuf.c      4240    280
             .__ioctl       1     0.1         ioctl.c     57576    240

This report confirms that most of the ticks are being used by the shared libraries (libc.a in this case). The profile of libc.a shows that most of those ticks are being consumed by the _doprnt() subroutine.

The _doprnt() subroutine is the processing module for the printf(), sprintf(), and other subroutines. With a simple change, you have increased the run time from 2.7 seconds to 8.6 seconds, and the formatted printing now consumes about 60 percent of the CPU time. This illustrates why formatting should be used judiciously. The performance of the _doprnt() subroutine is also affected by the locale. See Appendix E. National Language Support: Locale versus Speed. These tests were run in the C locale, which is the most efficient.

A tprof Example for releases starting with AIX 5.2

The tprof command has been completely rewritten for AIX 5.2 such that now it is much faster than before and provides more functionality. The syntax is different than the previous tprof, and you can view the complete documentation of tprof in AIX 5L Version 5.2 Commands Reference.

The following is an example of how to collect a CPU tick profile of the version1 program using the new tprof command, executed on a 4-way SMP system. Since this machine is a very fast running system, the command finished in less than a second. To make this program run longer, the array size was changed to 4096 instead of 1024, which is the value of the Asize variable in version1.c:

# tprof -z -u -p version1 -x version1

A file called version1.prof (shown below) is created in the current directory which reports how many CPU ticks for each of the programs running on the system while version1 was running.

          Process            Freq    Total   Kernel     User   Shared    Other
          =======            ====    =====   ======     ====   ======    =====
             wait               4     5810     5810        0        0        0
       ./version1               1     1672       35     1637        0        0
   /usr/bin/tprof               2       15       13        0        2        0
       /etc/syncd               1        2        2        0        0        0
      /usr/bin/sh               2        2        2        0        0        0
          swapper               1        1        1        0        0        0
 /usr/bin/trcstop               1        1        1        0        0        0
             rmcd               1        1        1        0        0        0
          =======             ===    =====   ======     ====   ======    =====
            Total              13     7504     5865     1637        2        0

           Process      PID      TID    Total   Kernel     User   Shared    Other
          =======      ===      ===    =====   ======     ====   ======    =====
             wait    16392    16393     1874     1874        0        0        0
             wait    12294    12295     1873     1873        0        0        0
             wait    20490    20491     1860     1860        0        0        0
       ./version1   245974   606263     1672       35     1637        0        0
             wait     8196     8197      203      203        0        0        0
   /usr/bin/tprof   291002   643291       13       13        0        0        0
   /usr/bin/tprof   274580   610467        2        0        0        2        0
       /etc/syncd    73824   110691        2        2        0        0        0
      /usr/bin/sh   245974   606263        1        1        0        0        0
      /usr/bin/sh   245976   606265        1        1        0        0        0
 /usr/bin/trcstop   245976   606263        1        1        0        0        0
          swapper        0        3        1        1        0        0        0
             rmcd   155876   348337        1        1        0        0        0
          =======      ===      ===    =====   ======     ====   ======    =====
            Total                       7504     5865     1637        2        0


      Total Samples = 7504       Total Elapsed Time = 18.76s

   Profile: ./version1
   Total Ticks For All Processes (./version1) = 1637

           Subroutine   Ticks     %        Source   Address  Bytes
        =============  ======  ======     =======   =======  =====
                .main    1637   21.82  version1.c       350    536


   Profile: ./version1
   Total Ticks For ./version1[245974] (./version1) = 1637

           Subroutine   Ticks     %        Source   Address  Bytes
        =============  ======  ======     =======   =======  =====
                .main    1637   21.82  version1.c       350    536

The first section of the report summarizes the results by program, regardless of the process ID. It shows the number of different processes (Freq) that ran each program at some point.

The second section of the tprof report shows the number of ticks consumed by, or on behalf of, each process. The program version1 used 1637 ticks itself and 35 ticks occurred in the kernel on behalf of theversion1 process.

The third section breaks down the user ticks associated with the executable program being profiled. It reports the number of ticks used by each function in the executable program, and the percentage of the total run's CPU ticks (7504) that each function's ticks represent. Since this system's CPUs were mostly idle, most of the 7504 ticks are idle ticks. To see what percentage of the busy time this program took, subtract the wait thread's CPU ticks (these are the idle CPU ticks) from the total and then subtract that from the total number of ticks. So, 7504-5810 gives us 1694. This is the total number of ticks for actual work done on the system. If we divide the version1 program's user ticks (1637) by 1694, we find that its percentage of system busy ticks is 1637/1694*100, which is 96.6%.

Offline Processing with the tprof Command prior to AIX 5.2

The -i Trace_File flag allows for offline processing by the tprof command of trace data files created by the system trace command. The -n flag allows you to specify a Gennames_File to be used when processing an offline file. These flags are useful when it is necessary to postprocess a trace file from a remote machine or perform the trace data collection at one time and postprocess it at another time. In this case -n with a Gennames_File must be used from the machine that the trace came from. The flags are also useful when system loading is high and trace hooks are being missed by the tprof command. The offline option relieves this problem.

Trace hooks relevant to the tprof command must be collected by the trace command and are specified by the trace -j flag. The gennames command is then executed to collect additional information for the tprof command. After the trace and gennames Gennames_File commands have executed, the trcrpt -r command must be executed on the trace logfile and redirected to another file. At this point an adjusted trace logfile and a Gennames_File is input to the tprof command.

For example:

# trace -af -T 1000000 -L 10000000 -o trace.out -j 000,001,002,003,005,006,234,106,10C,134,139,00A,465
# workload
# trcoff
# gennames > gennames.out
# trcstop
# trcrpt -r   trace.out > trace.rpt

Next run the tprof command with at least the -i and -n flags, as follows:

# tprof -i trace.rpt -n gennames.out -s -k -e

On systems with many CPUs, it is better to run the trace and trcrpt commands with the -C all flag (see Formatting a Report from trace -C Output).

Offline Processing with the tprof Command starting with AIX 5.2

The new tprof command can do offline processing of trace files, but it requires that filenames be specified with a rootstring name. This can be whatever you want to call it. Also, there are certain suffixes required for the input files that tprof will use. For example, the trace binary file should end in .trc. The trace binary file does not have to be post-processed with trcrpt -r any more. Also, instead of using gennames output, you need to collect gensyms output and put this in a file called rootstring.syms.

Let us call our rootstring trace1. Then to collect a trace, we can trace using all of the hooks or at least the following hooks:

# trace -af -T 1000000 -L 10000000 -o trace1.trc -j 000,001,002,003,005,006,234,106,10C,134,139,00A,465
# workload
# trcoff
# gensyms > trace1.syms
# trcstop
# trcrpt -r trace1 -k -u -s -z

This creates a trace1.prof file which gives you a CPU profile of the system while trace was running.