Performance Tools Guide and Reference

Performance Monitor API Programming

The libpmapi library contains a set of application programming interfaces (APIs) that are designed to provide access to some of the counting facilities of the Performance Monitor feature included in selected IBM microprocessors. Those APIs include the following:

A set of system-level APIs to allow counting of the activity of a whole machine or of a set of processes with a common ancestor.
A set of first party kernel-thread-level APIs to allow threads running in 1:1 mode to count their own activity.
A set of third party kernel-thread-level APIs to allow a debug program to count the activity of target threads running in 1:1 mode.

Note

The APIs and the events available on each of the supported processors have been completely separated by design. The events available, their descriptions, and their current testing status (which are different on each processor) are in separately installable tables, and are not described here because none of the API calls depend on the availability or status of any of the events.

The status of an event, as returned by the pm_init API initialization routine, can be verified, unverified, or works with some caveat (see Performance Monitor Accuracy about testing status and event accuracy).

An event filter (which is any combination of the status bits) must be passed to the pm_init routine to force the return of events with status matching the filter. If no filter is passed to the pm_init routine, no events will be returned.

The following topics discuss programming the Performance Monitor API:

Performance Monitor Accuracy

Only events marked verified have gone through full verification. Events marked caveat have been verified within the limitations documented in the event description returned by the pm_init routine.

Events marked unverified have undefined accuracy. Use caution with unverified events. The Performance Monitor API is essentially providing a service to read hardware registers that may not have any meaningful content.

Users may experiment with unverified event counters and determine for themselves if they can be used for specific tuning situations.

Performance Monitor Context and State

To provide Performance Monitor data access at various levels, the AIX operating system supports optional performance monitoring contexts. These contexts are an extension to the regular processor and thread contexts and include one 64-bit counter per hardware counter and a set of control words. The control words define which events are counted and when counting is on or off.

System-Level Context and Accumulation

For the system-level APIs, optional Performance Monitor contexts can be associated with each of the processors. When installed, the Performance Monitor kernel extension automatically handles 32-bit Performance Monitor hardware counter overflows. It also maintains per-processor sets of 64-bit accumulation counters (one per 32-bit hardware Performance Monitor counter).

Thread Context

Optional Performance Monitor contexts can also be associated with each kernel thread. The AIX operating system and the Performance Monitor kernel extension automatically maintain sets of 64-bit counters for each of these contexts.

Thread Counting-Group and Process Context

The concept of thread counting-group is optionally supported by the thread-level APIs. All the threads within a group, in addition to their own Performance Monitor context, share a group accumulation context. A thread group is defined as all the threads created by a common ancestor thread. By definition, all the threads in a thread group count the same set of events, and, with one exception described below, the group must be created before any of the descendant threads are created. This restriction is due to the fact that, after descendant threads are created, it is impossible to determine a list of threads with a common ancestor.

One special case of a group is the collection of all the threads belonging to a process. Such a group can be created at any time regardless of when the descendant threads are created, because a list of threads belonging to a process can be generated. Multiple groups can coexist within a process, but each thread can be a member of only one Performance Monitor counting-group. Because all the threads within a group must be counting the same events, a process group creation will fail if any thread within the process already has a context.

Performance Monitor State Inheritance

The PM state is defined as the combination of the Performance Monitor programmation (the events being counted), the counting state (on or off), and the optional thread group membership. A counting state is associated with each group. When the group is created, its counting state is inherited from the initial thread in the group. For thread members of a group, the effective counting state is the result of AND-ing their own counting state with the group counting state. This provides a way to effectively control the counting state for all threads in a group. Simply manipulating the group-counting state will affect the effective counting state of all the threads in the group. Threads inherit their complete Performance Monitor state from their parents when the thread is created. A thread Performance Monitor context data (the value of the 64-bit counters) is not inherited, that is, newly created threads start with counters set to zero.

Thread Accumulation and Thread Group Accumulation

When a thread gets suspended (or redispatched), its 64-bit accumulation counters are updated. If the thread is member of a group, the group accumulation counters are updated at the same time.

Similarly, when a thread stops counting or reads its Performance Monitor data, its 64 bit accumulation counters are also updated by adding the current value of the Performance Monitor hardware counters to them. Again, if the thread is a member of a group, the group accumulation counters are also updated, regardless of whether the counter read or stop was for the thread or for the thread group.

The group-level accumulation data is kept consistent with the individual Performance Monitor data for the thread members of the group, whenever possible. When a thread voluntarily leaves a group, that is, deletes its Performance Monitor context, its accumulated data is automatically subtracted from the group-level accumulated data. Similarly, when a thread member in a group resets its own data, the data in question is subtracted from the group level accumulated data. When a thread dies, no action is taken on the group-accumulated data.

The only situation where the group-level accumulation is not consistent with the sum of the data for each of its members is when the group-level accumulated data has been reset, and the group has more than one member. This situation is detected and marked by a bit returned when the group data is read.

Security Considerations

The system-level APIs calls are only available from the root user except when the process tree option is used. In that case, a locking mechanism prevents calls being made from more than one process. This mechanism ensures ownership of the API and exclusive access by one process from the time that the system-level contexts are created until they are deleted.

Enabling the process tree option results in counting for only the calling process and its descendants; the default is to count all activities on each processor.

Because the system-level APIs would report bogus data if thread contexts where in use, system-level API calls are not allowed at the same time as thread-level API calls. The allocation of the first thread context will take the system-level API lock, which will not be released until the last context has been deallocated.

When using first party calls, a thread is only allowed to modify its own Performance Monitor context. The only exception to this rule is when making group level calls, which obviously affect the group context, but can also affect other threads' context. Deleting a group deletes all the contexts associated with the group, that is, the caller context, the group context, and all the contexts belonging to all the threads in the group.

Access to a Performance Monitor context not belonging to the calling thread or its group is available only from the target process's debugger program. The third party API calls only succeed when the target process is being ptraced by the API caller, that is, the caller is already attached to the target process, and the target process is stopped.

The fact that the debugger program must already have been attached to the debugged thread before any third party call to the API can be made, ensures that the security level of the API will be the same as the one used between debugger programs and process being debugged.

Common Rules

The following rules are common to the Performance Monitor APIs:

The pm_init routine must be called before any other API call can be made, and only events returned by a given pm_init call with its associated filter setting can be used in subsequent pm_set_program calls.
PM contexts cannot be reprogrammed or reused at any time. This means that none of the APIs support more than one call to a pm_set_program interface without a call to a pm_delete_program interface. This also means that when creating a process group, none of the threads in the process is allowed to already have a context.
All the API calls return 0 when successful or a positive error code (which can be decoded using pm_error) otherwise.

The pm_init API Initialization Routine

The pm_init routine returns (in a structure of type pm_info_t pointed to by its second parameter) the processor name, the number of counters available, the list of available events for each counter, and the threshold multipliers supported. Some processor support two threshold multipliers, others none, meaning that thresholding is not supported at all.

For each event returned, in addition to the testing status, the pm_init routine also returns the identifier to be used in subsequent API calls, a short name, and a long name. The short name is a mnemonic name in the form PM_MNEMONIC. Events that are the same on different processors will have the same mnemonic name. For instance, PM_CYC and PM_INST_CMPL are respectively the number of processor cycles and instruction completed and should exist on all processors. For each event returned, a thresholdable flag is also returned. This flag indicates whether an event can be used with a threshold. If so, then specifying a threshold defers counting until a number of cycles equal to the threshold multiplied by the processor's selected threshold multiplier has been exceeded.

Beginning with AIX level 5.1.0.15, the Performance Monitoring API enables the specification of event groups instead of individual events. Event groups are predefined sets of events. Rather than each event being individually specified, a single group ID is specified. The interface to the pm_init routine has been enhanced to return the list of supported event groups in a structure of type pm_groups_info_t pointed to by a new optional third parameter. To preserve binary compatibility, the third parameter must be explicitly announced by OR-ing the PM_GET_GROUPS bitflag into the filter. Some events on some platforms can only be used from within a group. This is indicated in the threshold flag associated with each event returned. The following convention is used:

y	A thresholdable event
g	An event that can only be used in a group
G	A thresholdable event that can only be used in a group
n	A non-thresholdable event that is usable individually

On some platforms, use of event groups is required because all the events are marked g or G. Each of the event groups that are returned includes a short name, a long name, and a description similar to those associated with events, as well as a group identifier to be used in subsequent API calls and the events contained in the group (in the form of an array of event identifiers).

The testing status of a group is defined as the lowest common denominator among the testing status of the events that it includes. If at least one event has a testing status of caveat, the group testing status is at best caveat, and if at least one event has a status of unverified, then the group status is unverified. This is not returned as a group characteristic, but it is taken into account by the filter. Like events, only groups with status matching the filter are returned.

Eight Basic API Calls

Each of the eight sections below describes a system-wide API call that has variations for first-party kernel thread or group counting, and third-party kernel thread or group counting. Variations are indicated by suffixes to the function call names, such as pm_set_program, pm_set_program_mythread, and pm_set_program_group.

pm_set_program: Sets the counting configuration. Use this call to specify the events (as a list of event identifiers, one per counter, or as a single event-group identifier) to be counted, and a mode in which to count. The list of events to choose from is returned by the pm_init routine. If the list includes a thresholdable event, you can also use this call to specify a threshold, and a threshold multiplier.
The mode in which to count can include user-mode and kernel-mode counting, and whether to start counting immediately. For the system-wide API call, the mode also includes whether to turn counting on only for a process and its descendants or for the whole system. For counting group API calls, the mode includes the type of counting group to create, that is, a group containing the initial thread and its future descendants, or a process-level group, which includes all the threads in a process.
pm_get_program: Retrieves the current Performance Monitor settings. This includes mode information and the list of events (or the event group) being counted. If the list includes a thresholdable event, this call also returns a threshold and the multiplier used.
pm_delete_program: Deletes the Performance Monitor configuration. Use this call to undo pm_set_program.
pm_start: Starts Performance Monitor counting.
pm_stop: Stops Performance Monitor counting.
pm_get_data: Returns Performance Monitor counting data. The data is a set of 64-bit values, one per hardware counter. For the counting group API calls, the group information is also returned. (See Thread Counting-Group Information.)
The pm_get_data_cpu interface returns the Performance Monitor counting data for a single processor.
pm_get_tdata: Same as pm_get_data, but includes a timestamp that indicates the last time that the hardware Performance Monitoring counters were read. This is a timebase value that can be converted to time by using time_base_to_time.
The pm_get_tdata_cpu interface returns the Performance Monitor counting data for a single processor accompanied with a timestamp.
pm_reset_data: Resets Performance Monitor counting data. All values are set to 0.

Thread Counting-Group Information

The following information is associated with each thread counting-group:

member count: The number of threads that are members of the group. This includes deceased threads that were members of the group when running.
If the consistency flag is on, the count will be the number of threads that have contributed to the group-level data.
process flag: Indicates that the group includes all the threads in the process.
consistency flag: Indicates that the group PM data is consistent with the sum of the individual PM data for the thread members.

This information is returned by the pm_get_data_mygroup and pm_get_data_group interfaces in a pm_groupinfo_t structure.

Examples

The following are examples of using Performance Monitor APIs in pseudo-code. Functional sample code is available in the /usr/samples/pmapi directory.

Simple Single-Threaded Program:

# include <pmapi.h>
main()
{
       pm_info_t pminfo;
       pm_prog_t prog;
       pm_data_t data;
       int filter = PM_VERIFIED; /* use only verified events */

       pm_init(filter, &pminfo)

       prog.mode.w       = 0;  /* start with clean mode */
       prog.mode.b.user  = 1;  /* count only user mode */

       for (i = 0; i < pminfo.maxpmcs; i++)
                prog.events[i] = COUNT_NOTHING;

       prog.events[0]    = 1;  /* count event 1 in first counter */
       prog.events[1]    = 2;  /* count event 2 in second counter */

       pm_program_mythread(&prog);
       pm_start_mythread();

(1)    ... usefull work ....

       pm_stop_mythread();
       pm_get_data_mythread(&data);

       ... print results ...
}

Initialization Example Using an Event Group:

# include <pmapi.h>
main()
{
       pm_info_t        pminfo;
       pm_prog_t        prog; 
       pm_groups_info_t pmginfo; 

       int filter = PM_VERIFIED|PM_GET_GROUPS;  /* get list of verified events and groups */
 
       pm_init(filter, &pminfo;, &pmginfo;)
 
       prog.mode.w           = 0;  /* start with clean mode */
       prog.mode.b.user      = 1;  /* count only user mode */ 
       prog.mode.b.is_group  = 1;  /* specify event group */
 
       for (i = 0; i < pminfo.maxpmcs; i++)
                prog.events[i] = COUNT_NOTHING;
 
       prog.events[0]    = 1;  /* count events in group 1 */ 
       ..... 
}

Debugger Program Example for Initialization Program:

The following illustrates how to look at the Performance Monitor data while the program is executing:

from a debugger at breakpoint (1)

       pm_init(filter);
(2)    pm_get_program_thread(pid, tid, &prog);
       ... display PM programmation ...

(3)    pm_get_data_thread(pid, tid);
       ... display PM data ...

       pm_delete_program_thread(pid, tid);
       prog.events[0] = 2; /* change counter 1 to count event number 2 */
       pm_set_program_thread(pid, tid, &prog);

continue program

The preceding scenario would also work if the program being executed under the debugger did not have any embedded Performance Monitor API calls. The only difference would be that the calls at (2) and (3) would fail, and that when the program continues, it will be counting only event number 2 in counter 1, and nothing in other counters.

Simple Multi-Threaded Example:

The following is a simple multi-threaded example with independent threads counting the same set of events.

# include <pmapi.h>
pm_data_t data2;

void *
doit(void *)
{

(1)    pm_start_mythread();

       ... usefull work ....

       pm_stop_mythread();
       pm_get_data_mythread(&data2);
}

main()
{
       pthread_t threadid;
       pthread_attr_t attr;
       pthread_addr_t status;

       ... same initialization as in previous example ...

       pm_program_mythread(&prog);

       /* setup 1:1 mode, M:N not supported by APIs */
       pthread_attr_init(&attr);
       pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
       pthread_create(&threadid, &attr, doit, NULL);

(2)    pm_start_mythread();

       ... usefull work ....

       pm_stop_mythread();
       pm_get_data_mythread(&data);

       ... print main thread results (data )...

       pthread_join(threadid, &status);

       ... print auxiliary thread results (data2) ...
}

In the preceding example, counting starts at (1) and (2) for the main and auxiliary threads respectively because the initial counting state was off and it was inherited by the auxiliary thread from its creator.

Simple Thread Counting-Group Example:

The following example has two threads in a counting-group. The body of the auxiliary thread's initialization routine is the same as in the previous example.

main()
{
        ... same initialization as in previous example ...

        pm_program_mygroup(&prog); /* create counting group */
(1)     pm_start_mygroup()

        pthread_create(&threadid, &attr, doit, NULL)

(2)     pm_start_mythread(); 

        ... usefull work ....

        pm_stop_mythread();
        pm_get_data_mythread(&data)


        ... print main thread results ...

        pthread_join(threadid, &status);

        ... print auxiliary thread results ...

        pm_get_data_mygroup(&data)


        ... print group results ...
}

In the preceding example, the call in (2) is necessary because the call in (1) only turns on counting for the group, not the individual threads in it. At the end, the group results are the sum of both threads results.

Thread Counting Example with Reset:

The following example with a reset call illustrates the impact on the group data. The body of the auxiliary thread is the same as before, except for the pm_start_mythread call, which is not necessary in this case.

main()
{
        ... same initialization as in previous example...

        prog.mode.b.count  = 1;  /* start counting immediately */
        pm_program_mygroup(&prog);

        pthread_create(&threadid, pthread_attr_default, doit, NULL)

        ... usefull work ....

        pm_stop_mythread()
        pm_reset_data_mythread()

        pthread_join(threadid, &status);

        ...print auxiliary thread results...

        pm_get_data_mygroup(&data)


        ...print group results...
}

In the preceding example, the main thread and the group counting state are both on before the auxiliary thread is created, so the auxiliary thread will inherit that state and start counting immediately.

At the end, data1 is equal to data because the pm_reset_data_mythread automatically subtracted the main thread data from the group data to keep it consistent. In fact, the group data remains equal to the sum of the auxiliary and the main thread data, but in this case, the main thread data is null.