AIX Versions 3.2 and 4 Performance Tuning Guide

Appendix H. Accessing the Processor Timer

Attempts to measure very small time intervals in AIX are often frustrated by the intermittent background activity that is part of the operating system and by the processing time consumed by the system time routines. One approach to solving this problem is to access the processor timer directly to determine the beginning and ending times of measurement intervals, run the measurements repeatedly, and then filter the results to remove periods when an interrupt intervened.

The POWER and POWER2 architectures implement the processor timer as a pair of special-purpose registers. The PowerPC architecture defines a 64-bit register called the Time Base. These registers can only be accessed by assembler-language programs.

Attention: The time measured by the processor timer is the absolute wall-clock time. If an interrupt occurs between accesses to the timer, the calculated duration will include the processing of the interrupt and possibly other processes being dispatched before control is returned to the code being timed. The time from the processor timer is the raw time and should never be used in situations in which it will not be subjected to a reasonableness test.

In AIX Version 4, a pair of library subroutines has been added to the system to make accessing of these registers easier and architecture-independent. The subroutines are read_real_time and time_base_to_time. The read_real_time subroutine obtains the current time from the appropriate source and stores it as two 32-bit values. The time_base_to_time subroutine ensures that the time values are in seconds and nanoseconds, performing any necessary conversion from the TimeBase format. The reason for the separation of the time-acquisition and time-conversion functions is to minimize the overhead of time acquisition.

The following example shows how these new subroutines could be used to measure the elapsed time for a specific piece of code:

#include <stdio.h> 
#include <sys/time.h>    
    
int main(void) {   
   timebasestruct_t start, finish;    
   int val = 3;    
   int w1, w2;      
   double time;     
    
   /* get the time before the operation begins */
   read_real_time(&start, TIMEBASE_SZ);     
  
   /* begin code to be timed */
   printf("This is a sample line %d \n", val);    
   /* end code to be timed   */
  
   /* get the time after the operation is complete
   read_real_time(&finish, TIMEBASE_SZ);    
  
   /* call the conversion routines unconditionally, to ensure    */
   /* that both values are in seconds and nanoseconds regardless */
   /* of the hardware platform.                                  */
   time_base_to_time(&start, TIMEBASE_SZ);     
   time_base_to_time(&finish, TIMEBASE_SZ);    
    
   /* subtract the starting time from the ending time */
   w1 = finish.tb_high - start.tb_high; /* probably zero */ 
   w2 = finish.tb_low - start.tb_low; 
    
   /* if there was a carry from low-order to high-order during  */
   /* the measurement, we may have to undo it.                  */
   if (w2 < 0) {    
      w1--;   
      w2 += 1000000000;   
   }    
   
   /* convert the net elapsed time to floating point microseconds */
   time = ((double) w2)/1000.0;     
   if (w1 > 0)      
      time += ((double) w1)*1000000.0;      
    
   printf("Time was %9.3f microseconds \n", time);     
   exit(0);   
}

To minimize the overhead of calling and returning from the timer routines, the analyst may want to experiment with binding the benchmark nonshared (see Appendix G).

If this were a real performance benchmark, we would perform the code to be measured repeatedly. If we timed a number of consecutive repetitions collectively, we could calculate an average time for the operation, but it might include interrupt handling or other extraneous activity. If we timed a number of repetitions individually, we could inspect the individual times for reasonableness, but the overhead of the timing routines would be included in each measurement. It may be desirable to use both techniques and compare the results. In any case, the analyst will want to consider the purpose of the measurements in choosing the method.

POWER-Architecture-Unique Timer Access

Attention: The following discussion applies only to the POWER and POWER2 architectures (and the IBM 601 processor chip). The code examples will function correctly in a PowerPC system(that is, they won't blow up), but some of the instructions will be simulated. Since the purpose of accessing the processor timer is to obtain high-precision times with low overhead, simulation makes the results much less useful.

The POWER and POWER2 processor architectures include two special-purpose registers (an upper register and a lower register) that contain a high-resolution timer. The upper register contains time in seconds, and the lower register contains a count of fractional seconds in nanoseconds. The actual precision of the time in the lower register depends on its update frequency, which is model-specific.

Assembler Routines to Access the POWER Timer Registers

The following assembler-language module (timer.s) provides routines (rtc_upper and rtc_lower ) to access the upper and lower registers of the timer.

           .globl  .rtc_upper
.rtc_upper: mfspr   3,4         # copy RTCU to return register
            br
  
           .globl  .rtc_lower
.rtc_lower: mfspr   3,5         # copy RTCL to return register
            br

C Subroutine to Supply the Time in Seconds

The following module (second.c) contains a C routine that calls the timer.s routines to access the upper and lower register contents and returns a double-precision real value of time in seconds.

double second()
{
  int ts, tl, tu;
   
  ts = rtc_upper();     /* seconds                          */
  tl = rtc_lower();     /* nanoseconds                      */
  tu = rtc_upper();     /* Check for a carry from           */
  if (ts != tu)         /* the lower reg to the upper.      */
    tl = rtc_lower();   /* Recover from the race condition. */
  return ( tu + (double)tl/1000000000 );
}

The subroutine second, can be called from either a C routine or a FORTRAN routine.

Note: Depending on the length of time since the last system reset, second.c may yield a varying amount of precision. The longer the time since reset, the larger the number of bits of precision consumed by the (probably uninteresting) whole-seconds part of the number. The technique shown in the first part of this Appendix avoids this problem by performing the subtraction required to obtain an elapsed time before converting to floating point.

Accessing Timer Registers in PowerPC-Architecture Systems

The PowerPC processor architecture includes a 64-bit Time Base register, which is logically divided into 32-bit upper and lower fields (TBU and TBL). The Time Base register is incremented at a frequency that is hardware and software implementation dependent and may vary from time to time. Transforming the values from Time Base into seconds is a more complex task than in the POWER architecture. We strongly recommend using the read_real_time and time_base_to_time interfaces to obtain time values in PowerPC systems.

Example Use of the second Routine

An example (main.c) of a C program using the second subroutine is:

#include <stdio.h>
double second();
main()
{
      double t1,t2;
  
      t1 = second();
      my_favorite_function();
      t2 = second();
  
     printf("my_favorite_function time: %7.9f\n",t2 - t1);
      exit();
}

An example (main.f) of a FORTRAN program using the second subroutine is:

      double precision t1
      double precision t2
        
      t1 = second()
      my_favorite_subroutine()
      t2 = second()
      write(6,11) (t2 - t1)
11    format(f20.12)
      end

To compile and use either main.c or main.f, use the following:

xlc -O3 -c second.c timer.s
xlf -O3 -o mainF main.f second.o timer.o
xlc -O3 -o mainC main.c second.o timer.o