The following information provides an overview of the Performance Monitor API library .
Read the following to learn more about programming the Performance Monitor API for threads:
This article describes the libpmapi library which contains a set of Application Programming Interfaces designed to provide access to some of the counting facilities of the Performance Monitor feature included in selected IBM micro-processors in the POWERPC family. Those APIs include :
The APIs and the events available on each of the supported processors have been completely separated by design. The events available, which are different on each processor, and their descriptions as well as their current testing status are in separately installable tables, and are not described here at all because none of the API calls depend on the availability or status of any of the events.
The status of an event, as returned in bitflags by the API initialization routine pm_init, can be verified, unverified, or works with some caveat (see next section for an important warning (Performance Monitor Accuracy Warning) about testing status and event accuracy).
An event filter, which is any combination of the status bits, must be passed to pm_init to force the return of only events with a status matching the filter. If no filter is passed to pm_init, no events will be returned.
For each event, in addition to a testing status and a full description, the identifier to be used in subsequent API calls, and a short and a long name are also returned by pm_init. The short name is a mnemonic name in the form PM_MNEMONIC. Events that are the same on different processors will have the same mnemonic name. For instance PM_CYC and PM_INST_CMPL are respectively the number of processor cycles and instruction completed and should exist on most processors.
Only events marked verified have gone through full verification. Events marked caveat have been verified within the limitations documented in the event description returned by pm_init.
Events marked unverified have undefined accuracy. Use caution with unverified events; the PM API is essentially providing a service to read hardware registers which may or may not have any meaningful content.
Users may experiment with unverified event counters and determine for themselves what, if any, use they may have for specific tuning situations.
Definitions
To provide Performance Monitor data access at various levels, support has been added to the Operating System for optional Performance Monitoring contexts. These contexts are an extension to the regular processor and thread contexts and include one 64 bit counter per hardware counter and a set of control words. The control words define what events get counted and when counting is on or off.
System level context and accumulation
For the system level APIs, optional PM contexts can be associated with each of the processors. When installed, the PM kernel extension automatically handles 32 bit PM hardware counter overflows, and maintains per-processor sets of 64 bit accumulation counters, one per 32 bit hardware PM counter.
Thread context
Optional PM contexts can also be associated with each kernel thread. The Operating System and the PM kernel extension automatically maintain sets of 64 bit counters for each of these contexts.
Thread group and process context
The concept of thread group is optionally supported by the thread level APIs. All the threads within a group, in addition to their own PM context, share a group accumulation context. A group is defined as all the threads created by a common ancestor thread. By definition, all the threads in a thread group count the same set of events, and, with one exception described below, the group must be created before any of the descendant threads are created. The second restriction stems from the fact that once descendant threads are created, it is impossible to determine a list of threads with a common ancestor. One special case of a group is the collection of all the threads belonging to a process. Such a group can be created at any time regardless of when the descendant threads are created. This is made possible by the fact that a list of threads belonging to a process can be generated. Multiple groups can coexist within a process, but each thread can be a member of only one PM counting group. Since all the threads within a group must be counting the same events, a process group creation will fail if any thread within the process already has a context.
PM state Inheritance
The PM state is defined as the combination of the PM programmation (the events being counted), the counting state (on or off), and the optional thread group membership. There is a counting state associated with each group. When the group is created, its counting state is inherited from the initial thread in the group. For threads member of a group, the effective counting state is the result of ANDing their own counting state with the group counting state. This provides a way to effectively turn counting on and off for all threads in a group. Simply manipulating the group counting state will affect the effective counting state of all the threads in the group. Threads inherit their complete PM state from their parents when the thread is created. A thread PM context data (the value of the 64 bit counters) is not inherited, i.e. newly created threads start with counters set to zero.
PM context independence
The thread and thread group PM contexts are independent. This allows each of the thread or group of threads on a system to program themselves to be counted with their own list of events. In other words, except when using the system level API, there is no requirement that all threads counts the same events and, using a debugger, a user can certainly program threads or groups of threads to count different events.
When a thread gets suspended (or re-dispatched), its 64 bit accumulation counters are updated. If the thread is member of a group, the group accumulation counters are updated at the same time.
Similarly, when a thread stops counting or reads its PM data, its 64 bit accumulation counters are also updated by adding the current value of the PM hardware counters to them. Again, if the thread is member of a group, the group accumulation counters are also updated, regardless of whether the counter read or stop was for the thread or the thread group.
The group level accumulation data is kept consistent with the individual PM data for the thread members of the group, whenever possible. When a thread voluntarily leaves a group, i.e., deletes its PM context, its accumulated data is automatically subtracted from the group level accumulated data. Similarly, when a thread member in a group resets its own data, the data in question is subtracted from the group level accumulated data. Note that when a thread dies, no action is taken on the group accumulated data.
The only situation where the group level accumulation is not consistent with the sum of the data for each of its members is when the group level accumulated data has been reset, and the group has more then one member. This situation is detected and marked by a bit returned when the group data is read.
The system level APIs calls are only available from the super user except when the process tree option is used. In that case a locking mechanism prevents calls to be made from more than one process. This mechanism ensures ownership of the API and exclusive access by one process from the time the system level contexts are created until they are deleted.
Turning on the process tree option results in counting for only the calling process and its descendants; the default is to count all activities on each processor.
Since the system level APIs would report bogus data if thread contexts where in use, it is not allowed to make system level API calls at the same time as thread level API calls. The allocation of the first thread context will take the system level API lock which will not be released until the last context has been deallocated.
When using first party calls, a thread is only allowed to modify its own PM context. The only exception to this rule is when making group level calls, which obviously affect the group context, but can also affect other threads context. Indeed, deleting a group deletes all the contexts associated with the group, i.e., the caller context, the group context and all the contexts belonging to all the threads in the group.
Access to a PM context not belonging to the calling thread or its group is only available from the target process's debugger. The third party API calls only succeed when the target process is being ptraced by the API caller, i.e., the caller is already attached to the target process, and the debuggee is stopped.
The fact that the debugger must already have been attached to the debugged thread before any third party call to the API can be made, ensures that the security level of the API will be the same as the one used between debuggers and debuggees.
Common rules
pm_init must be called before any other API call can be made, and only events returned by a given pm_init call with its associated filter setting can be used in subsequent pm_set_program calls. pm_init also returns the processor name, the number of counters available (2, 4 or 8), and the threshold multiplier. For each event returned, a thresholdable flag is also returned. This flag indicates whether an event can be used with a threshold. If so, then specifying a threshold defers counting until the threshold multiplied by the processor's threshold multiplier has been exceeded.
PM contexts cannot be reprogrammed or reused at any time. This means that none of the APIs support more than one call to a pm_set_program interface without a call to a pm_delete_program interface. This also means that when creating a process group, none of the threads in the process is allowed to already have a context.
All the API calls return 0 when successful or a positive error code (which can be decoded using pm_error) otherwise.
Group information
The following information is associated with each thread group :
the number of threads member of the group. This includes deceased threads which were member of the group when running.
If the consistency flag is on, it will be the number of threads that have contributed to the group level data.
indicates that the group includes all the threads in the process.
indicates that the group PM data is consistent with the sum of the individual PM data for the thread members.
This information is returned by the pm_get_data_mygroup and pm_get_data_group interfaces in a pm_groupinfo_t structure.
Each of the seven section below describes a system-wide API call that has variations for first-party kernel thread or group counting, and third-party kernel thread or group counting. Variations are indicated by suffixes to the function call names, such as pm_set_program, pm_set_program_mythread, pm_set_program_group etc.
Example code is also shipped to the /usr/samples/pmapi directory.
Simple single threaded program
#include main() { pm_info_t pminfo; pm_prog_t prog; pm_data_t data; int filter = PM_VERIFIED; /* use only verified events */ pm_init(filter, &pminfo) prog.mode.w = 0; /* start with clean mode */ prog.mode.b.user = 1; /* count only user mode */ for (i = 0; i < pminfo.maxpmcs; i++) prog.events[i] = COUNT_NOTHING; prog.events[0] = 1; /* count event 1 in first counter */ prog.events[1] = 2; /* count event 2 in second counter */ pm_program_mythread(&prog); pm_start_mythread(); (1) ... usefull work .... pm_stop_mythread(); pm_get_data_mythread(&data); ... print results ... }
Debugger example for previous program
To look at the PM data while the program is executing :
from a debugger at breakpoint (1) pm_init(filter); (2) pm_get_program_thread(pid, tid, &prog); ... display PM programmation ... (3) pm_get_data_thread(pid, tid); ... display PM data ... pm_delete_program_thread(pid, tid); prog.events[0] = 2; /* change counter 1 to count event number 2 */ pm_set_program_thread(pid, tid, &prog); continue program
The scenario above would work as well if the program being executed under the debugger didn't have any embedded PM API calls. The only difference would be that the calls at (2) and (3) would fail, and that when the program continues, it will be counting only event number 2 in counter 1, and nothing in other counters.
Simple multithreaded example
A simple multithreaded example with independent threads counting the same set of events.
#include pm_data_t data2; void * doit(void *) { (1) pm_start_mythread(); ... usefull work .... pm_stop_mythread(); pm_get_data_mythread(&data2); } main() { pthread_t threadid; pthread_attr_t attr; pthread_addr_t status; ... same initialization as in previous example ... pm_program_mythread(&prog); /* setup 1:1 mode, M:N not supported by APIs */ pthread_attr_init(&attr); pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM); pthread_create(&threadid, &attr, doit, NULL); (2) pm_start_mythread(); ... usefull work .... pm_stop_mythread(); pm_get_data_mythread(&data); ... print main thread results (data )... pthread_join(threadid, &status); ... print auxiliary thread results (data2) ... }
Counting starts at (1) and (2) for the main and auxiliary threads respectively because the initial counting state was off and it was inherited by the auxiliary thread from its creator.
Example with two threads in a counting group. The body of the auxiliary thread's initialization routine is the same as in the previous example.
main() { ... same initialization as in previous example ... pm_program_mygroup(&prog); /* create counting group */ (1) pm_start_mygroup() pthread_create(&threadid, &attr, doit, NULL) (2) pm_start_mythread(); ... usefull work .... pm_stop_mythread(); pm_get_data_mythread(&data) ... print main thread results ... pthread_join(threadid, &status); ... print auxiliary thread results ... pm_get_data_mygroup(&data) ... print group results ... }
The call in (2) is necessary because the call in (1) only turns on counting for the group, not the individual threads in it. At the end, the group results are the sum of both threads results.
This example with a reset call illustrates the impact on the group data. The body of the auxiliary thread is the same as before, except for the pm_start_mythread() call which is not necessary in this case.
main() { ... same initialization as in previous example... prog.mode.b.count = 1; /* start counting immediately */ pm_program_mygroup(&prog); pthread_create(&threadid, pthread_attr_default, doit, NULL) ... usefull work .... pm_stop_mythread() pm_reset_data_mythread() pthread_join(threadid, &status); ...print auxiliary thread results... pm_get_data_mygroup(&data) ...print group results... }
The main thread and the group counting state are both on before the auxiliary thread is created so the auxiliary thread will inherit that state and start counting immediately.
At the end, data1 is equal to data because the pm_reset_data_mythread automatically subtracted the main thread data from the group data to keep it consistent. In fact at all time the group data is still equal to the sum of the auxiliary and the main thread data, but in this case the main thread data is null.