Kernel Extensions and Device Support Programming Concepts

Debug and Performance Tracing

The trace facility is useful for observing a running device driver and system. The trace facility captures a sequential flow of time-stamped system events, providing a fine level of detail on system activity. Events are shown in time sequence and in the context of other events. The trace facility is useful in expanding the trace event information to understand who, when, how, and even why the event happened.

Introduction

The operating system is shipped with permanent trace event points. These events provide general visibility to system execution. You can extend the visibility into applications by inserting additional events and providing formatting rules.

The collection of trace data was designed so that system performance and flow would be minimally altered by activating trace. Because of this, the facility is extremely useful as a performance analysis tool and as a problem determination tool.

The trace facility is more flexible than traditional system monitor services that access and present statistics maintained by the system. With traditional monitor services, data reduction (conversion of system events to statistics) is largely coupled to the system instrumentation. For example, the system can maintain the minimum, maximum, and average elapsed time observed for runs of a task and permit this information to be extracted.

The trace facility does not strongly couple data reduction to instrumentation but provides a stream of system events. It is not required to presuppose what statistics are needed. The statistics or data reduction are to a large degree separated from the instrumentation.

You can choose to develop the minimum, maximum, and average time for task A from the flow of events. But it is also possible to extract the average time for task A when called by process B, extract the average time for task A when conditions XYZ are met, develop a standard deviation for task A, or even decide that some other task, recognized by a stream of events, is more meaningful to summarize. This flexibility is invaluable for diagnosing performance or functional problems.

The trace facility generates large volumes of data. This data cannot be captured for extended periods of time without overflowing the storage device. This allows two practical ways that the trace facility can be used natively.

First, the trace facility can be triggered in multiple ways to capture small increments of system activity. It is practical to capture seconds to minutes of system activity in this way for post-processing. This is sufficient time to characterize major application transactions or interesting sections of a long task.

Second, the trace facility can be configured to direct the event stream to standard output. This allows a real-time process to connect to the event stream and provide data reduction in real-time, thereby creating a long term monitoring capability. A logical extension for specialized instrumentation is to direct the data stream to an auxiliary device that can either store massive amounts of data or provide dynamic data reduction.

You can start the system trace from:

The command line
SMIT
Software

The trace facility causes predefined events to be written to a trace log. The tracing action is then stopped.

Tracing from a command line is discussed in Controlling trace. Tracing from a software application is discussed and an example is presented in Examples of Coding Events and Formatting Events.

After a trace is started and stopped, you must format it before viewing it.

To format the trace events that you have defined, you must provide a stanza that describes how the trace formatter is to interpret the data that has been collected. This is described in Syntax for Stanzas in the trace Format File.

The trcrpt command provides a general purpose report facility. The report facility provides little data reduction, but converts the raw binary event stream to a readable ASCII listing of the event stream. Data can be visually extracted by a reader, or tools can be developed to further reduce the data.

For an event to be traced, you must write an event hook (sometimes called a trace hook) into the code that you want to trace. Tracing can be done on either the system channel (channel 0) or on a generic channel (channels 1-7). All preshipped trace points are output to the system channel.

Usually, when you want to show interaction with other system routines, use the system channel. The generic channels are provided so that you can control how much data is written to the trace log. Only your data is written to one of the generic channels.

For more information on trace hooks, see Macros for Recording trace Events.

Using the trace Facility

The following sections describe the use of the trace facility.

Configuring and Starting trace Data Collection

The trace command configures the trace facility and starts data collection. You can start trace from the command line or with a trcstart subroutine call. The trcstart subroutine is in the librts.a library. The syntax of the trcstart subroutine is:

int trcstart(char *args)

where args is simply the options list desired that you would enter using the trace command if starting a system trace (channel 0). If starting a generic trace, include a -g option in the args string. On successful completion, trcstart returns the channel ID. For generic tracing this channel ID can be used to record to the private generic channel.

For an example of the trcstart routine, see Examples of Coding Events and Formatting Events.

When compiling a program using this subroutine, you must request the link to the librts.a library. Use -l rts as a compile option.

Controlling trace

Basic controls for the trace facility exist as trace subcommands, standalone commands, and subroutines.

If you configure the trace routine to run asynchronously (the -a option), you can control the trace facility with the following commands:

trcon	Turns collection of trace data on.
trcoff	Turns collection of trace data off.
trcstop	Stops collection of trace data (like trcoff) and terminates the trace routine.

Producing a trace Report

The trace report facility formats and displays the collected event stream in readable form. This report facility displays text and data for each event according to rules provided in the trace format file. The default trace format file is /etc/trcfmt and contains a stanza for each event ID. The stanza for the event provides the report facility with formatting rules for that event. This technique allows you to add your own events to programs and insert corresponding event stanzas in the format file to have their new events formatted.

This report facility does not attempt to extract summary statistics (such as CPU utilization and disk utilization) from the event stream. This can be done in several other ways. To create simple summaries, consider using awk scripts to process the output obtained from the trcrpt command.

Defining trace Events

The operating system is shipped with predefined trace hooks (events). You need only activate trace to capture the flow of events from the operating system. You might want to define trace events in your code during development for tuning purposes. This provides insight into how the program is interacting with the system. The following sections provide the information that is required to do this.

Possible Forms of a trace Event Record

A trace event can take several forms. An event consists of the following:

Hookword
Data words (optional)
A TID, or thread identifier
Timestamp (optional)

An event record should be as short as possible. Many system events use only the hookword and timestamp. There is another event type you should seldom use because it is less efficient. It is a long format that allows you to record a variable length data. In this long form, the 16-bit data field of the hookword is converted to a length field that describes the length of the event record.

Macros for Recording trace Events

The following macros should always be used to generate trace data. Do not call the tracing functions directly. There is a macro to record each possible type of event record. The macros are defined in the sys/trcmacros.h header file. The event IDs are defined in the sys/trchkid.h header file. Include these two header files in any program that is recording trace events.

The macros to record system (channel 0) events with a time stamp are:

TRCHKL0T (hw)
TRCHKL1T (hw,D1)
TRCHKL2T (hw,D1,D2)
TRCHKL3T (hw,D1,D2,D3)
TRCHKL4T (hw,D1,D2,D3,D4)
TRCHKL5T (hw,D1,D2,D3,D4,D5)

Similarly, to record non-time stamped system events (channel 0), use the following macros:

TRCHKL0 (hw)
TRCHKL1 (hw,D1)
TRCHKL2 (hw,D1,D2)
TRCHKL3 (hw,D1,D2,D3)
TRCHKL4 (hw,D1,D2,D3,D4)
TRCHKL5 (hw,D1,D2,D3,D4,D5)

There are only two macros to record events to one of the generic channels (channels 1-7). These are:

TRCGEN (ch,hw,d1,len,buf)
TRCGENT (ch,hw,d1,len,buf)

These macros record a hookword (hw), a data word (d1) and a length of data (len) specified in bytes from the user's data segment at the location specified (buf) to the event stream specified by the channel (ch).

Use of Event IDs (hookids)

Event IDs are 12 bits (or 3-digit hexadecimal), for a possibility of 4096 IDs. Event IDs that are permanently left in and shipped with code need to be permanently assigned. Permanently assigned event IDs are defined in the sys/trchkid.h header file.

To allow you to define events in your environments or during development, a range of event IDs exist for temporary use. The range of event IDs for temporary use is hex 010 through hex 0FF. No permanent (shipped) events are assigned in this range. You can freely use this range of IDs in your own environment. If you do use IDs in this range, do not let the code leave your environment.

Permanent events must have event IDs assigned by the current owner of the trace component. To obtain a trace event id, send a note with a subject of help to aixras@austin.ibm.com.

You should conserve event IDs because they are limited. Event IDs can be extended by the data field. The only reason to have a unique ID is that an ID is the level at which collection and report filtering is available in the trace facility. An ID can be collected or not collected by the trace collection process and reported or not reported by the trace report facility. Whole applications can be instrumented using only one event ID. The only restriction is that the granularity on choosing visibility is to choose whether events for that application are visible.

A new event can be formatted by the trace report facility (trcrpt command) if you create a stanza for the event in the trace format file. The trace format file is an editable ASCII file. The syntax for a format stanzas is shown in Syntax for Stanzas in the trace Format File. All permanently assigned event IDs should have an appropriate stanza in the default trace format file shipped with the base operating system.

Suggested Locations and Data for Permanent Events

The intent of permanent events is to give an adequate level of visibility to determine execution, and data flow and have an adequate accounting for how CPU time is being consumed. During code development, it can be desirable to make very detailed use of trace for a component. For example, you can choose to trace the entry and exit of every subroutine in order to understand and tune path length. However, this would generally be an excessive level of instrumentation to ship for a component.

Consult a performance analyst for decisions regarding what events and data to capture as permanent events for a new component. The following paragraphs provide some guidelines for these decisions.

Events should capture execution flow and data flow between major components or major sections of a component. For example, there are existing events that capture the interface between the virtual memory manager and the logical volume manager. If work is being queued, data that identifies the queued item (a handle) should be recorded with the event. When a queue element is being processed, the "dequeue" event should provide this identifier as data also, so that the queue element being serviced is identified.

Data or requests that are identified by different handles at different levels of the system should have events and data that allow them to be uniquely identified at any level. For example, a read request to the physical file system is identified by a file descriptor and a current offset in the file. To a virtual memory manager, the same request is identified by a segment ID and a virtual page address. At the disk device driver level, this request is identified as a pointer to a structure that contains pertinent data for the request.

The file descriptor or segment information is not available at the device driver level. Events must provide the necessary data to link these identifiers so that, for example, when a disk interrupt occurs for incoming data, the identifier at that level (which can simply be the buffer address for where the data is to be copied) can be linked to the original user request for data at some offset into a file.

Events should provide visibility to major protocol events such as requests, responses, acknowledgements, errors, and retries. If a request at some level is fragmented into multiple requests, a trace event should indicate this and supply linkage data to allow the multiple requests to be tracked from that point. If multiple requests at some level are coalesced into a single request, a trace event should also indicate this and provide appropriate data to track the new request.

Use events to give visibility to resource consumption. Whenever resources are claimed, returned, created, or deleted an event should record the fact. For example, claiming or returning buffers to a buffer pool or growing or shrinking the number of buffers in the pool.

The following guidelines can help you determine where and when you should have trace hooks in your code:

Tracing entry and exit points of every function is not necessary. Provide only key actions and data.
Show linkage between major code blocks or processes.
If work is queued, associate a name (handle) with it and output it as data.
If a queue is being serviced, the trace event should indicate the unique element being serviced.
If a work request or response is being referenced by different handles as it passes through different software components, trace the transactions so the action or receipt can be identified.
Place trace hooks so that requests, responses, errors, and retries can be observed.
Identify when resources are claimed, returned, created, or destroyed.

Also note that:

A trace ID can be used for a group of events by "switching" on one of the data fields. This means that a particular data field can be used to identify from where the trace point was called. The trace format routine can be made to format the trace data for that unique trace point.
The trace hook is the level at which a group of events can be enabled or disabled. Note that trace hooks can be grouped in SMIT. For more information, see Trace Event Groups.

Syntax for Stanzas in the trace Format File

The intent of the trace format file is to provide rules for presentation and display of the expected data for each event ID. This allows new events to be formatted without changing the report facility. Rules for new events are simply added to the format file. The syntax of the rules provide flexibility in the presentation of the data.

A trace format stanza can be as long as required to describe the rules for any particular event. The stanza can be continued to the next line by terminating the present line with a backslash (\). The fields are:

event_id

Each stanza begins with the three-digit hexadecimal event ID that the stanza describes, followed by a space.

V.R

This field describes the version (V) and release (R) that the event was first assigned. Any integers work for V and R, and you might want to keep your own tracking mechanism.

L=

The text description of an event can begin at various indentation levels. This improves the readability of the report output. The indentation levels correspond to the level at which the system is running. The recognized levels are:

APPL: Application level
SVC: Transitioning system call
KERN: Kernel level
INT: Interrupt

event_label

The event_label is an ASCII text string that describes the overall use of the event ID. This is used by the -j option of the trcrpt command to provide a listing of events and their first level description. The event label also appears in the formatted output for the event unless the event_label field starts with an @ character.

\n

The event stanza describes how to parse, label, and present the data contained in an event record. You can insert a \n (newline) in the event stanza to continue data presentation of the data on a new line. This allows the presentation of the data for an event to be several lines long.

\t

The \t (tab) function inserts a tab at the point it is encountered in parsing the description. This is similar to the way the \n function inserts new lines. Spacing can also be inserted by spaces in the data_label or match_label fields.

starttimer(#,#)

The starttimer and endtimer fields work together. The (#,#) field is a unique identifier that associates a particular starttimer value with an endtimer that has the same identifier. By convention, if possible, the identifiers should be the ID of starting event and the ID of the ending event.

When the report facility encounters a start timer directive while parsing an event, it associates the starting events time with the unique identifier. When an end timer with the same identifier is encountered, the report facility outputs the delta time (this appears in brackets) that elapsed between the starting event and ending event. The begin and end system call events make use of this capability. On the return from system call event, a delta time is shown that indicates how long the system call took.

endtimer(#,#)

See the starttimer field in the preceding paragraph.

data_descriptor

The data_descriptor field is the fundamental field that describes how the report facility consumes, labels, and presents the data.

The various subfields of the data_descriptor field are:

data_label: The data label is an ASCII string that can optionally precede the output of data consumed by the following format field.
format: You can think of the report facility as having a pointer into the data portion of an event. This data pointer is initialized to point to the beginning of the event data (the 16-bit data field in the hookword). The format field describes how much data the report facility consumes from this point and how the data is considered. For example, a value of Bm.n tells the report facility to consume m bytes and n bits of data and to consider it as binary data.
The possible format fields are described in the following section. If this field is not followed by a comma, the report facility outputs the consumed data in the format specified. If this field is followed by a comma, it signifies that the data is not to be displayed but instead compared against the following match_vals field. The data descriptor associated with the matching match_val field is then applied to the remainder of the data.

match_val

The match value is data of the same format described by the preceding format fields. Several match values typically follow a format field that is being matched. The successive match fields are separated by commas. The last match value is not followed by a comma. Use the character string \* as a pattern-matching character to match anything. A pattern-matching character is frequently used as the last element of the match_val field to specify default rules if the preceding match_val field did not occur.

match_label

The match label is an ASCII string that is output for the corresponding match.

Each of the possible format fields is described in the comments of the /etc/trcfmt file. The following shows several possibilities:

Format field

descriptions

In most cases, the data length part of the specifier can also be the letter "W" which indicates that the word size of the trace hook is to be used. For example, XW will format 4 or 8 bytes into hexadecimal, depending upon whether the trace hook comes from a 32 or 64 bit environment.
Am.n	This value specifies that m bytes of data are consumed as ASCII text, and that it is displayed in an output field that is n characters wide. The data pointer is moved m bytes.
S1, S2, S4	Left justified string. The length of the field is defined as 1 byte (S1), 2 bytes (S2), or 4 bytes (S4) and so on. The data pointer is moved accordingly. SW indicates that the word size for the trace event is to be used.
Bm.n	Binary data of m bytes and n bits. The data pointer is moved accordingly.
Xm	Hexadecimal data of m bytes. The data pointer is moved accordingly.
D2, D4	Signed decimal format. Data length of 2 (D2) bytes or 4 (D4) bytes is consumed.
U2, U4	Unsigned decimal format. Data length of 2 or 4 bytes is consumed.
F4, F8	Floating point of 4 or 8 bytes.
Gm.n	Positions the data pointer. It specifies that the data pointer is positioned m bytes and n bits into the data.
Om.n	Skip or omit data. It omits m bytes and n bits.
Rm	Reverse the data pointer m bytes.
Wm	Position DATA_POINTER at word m. The word size is either 4 or 8 bytes, depending upon whether or not this is a 32 or 64 bit format trace. This bares no relation to the %W format specifier.

Some macros are provided that can be used as format fields to quickly access data. For example:

$D1, $D2, $D3, $D4, $D5

These macros access data words 1 through 5 of the event record without moving the data pointer. The data accessed by a macro is hexadecimal by default. A macro can be cast to a different data type (X, D, U, B) by using a % character followed by the new format code. For example, the following macro causes data word one to be accessed, but to be considered as 2 bytes and 3 bits of binary data:

$D1%B2.3

$HD

This macro accesses the first 16 bits of data contained in the hookword, in a similar manner as the $D1 through $D5 macros access the various data words. It is also considered as hexadecimal data, and also can be cast.

You can define other macros and use other formatting techniques in the trace format file. This is shown in the following trace format file example.

Example Trace Format File

# @(#)65  1.142  src/bos/usr/bin/trcrpt/trcfmt, cmdtrace, bos43N, 9909A_43N 2/12/99 13:15:34
# COMPONENT_NAME: CMDTRACE   system trace logging and reporting facility
#
# FUNCTIONS: template file for trcrpt
#
# ORIGINS: 27, 83
#
# (C) COPYRIGHT International Business Machines Corp. 1988, 1993
# All Rights Reserved
# Licensed Materials - Property of IBM
#
# US Government Users Restricted Rights - Use, duplication or
# disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
#
# LEVEL 1, 5 Years Bull Confidential Information
#

# I. General Information
#
#    The formats shown below apply to the data placed into the
#    trcrpt format buffer.  These formats in general mirror the binary
#    format of the data in the trace stream.  The exceptions are
#    hooks from a 32-bit application on a 64-bit kernel, and hooks from a
#    64-bit application on a 32-bit kernel.  These exceptions are noted
#    below as appropriate.
#
#    Trace formatting templates should not use the thread id or time
#    stamp from the buffer.  The thread id should be obtained with the
#    $TID macro.  The time stamp is a raw timer value used by trcrpt to
#    calculate the elapsed and delta times.  These values are either
#    4 or 8 bytes depending upon the system the trace was run on, not upon
#    the environment from which the hook was generated.
#    The system environment, 32 or 64 bit, and the hook's
#    environment, 32 or 64 bit, are obtained from the $TRACEENV and $HOOKENV
#    macros discussed below.
#
#    To interpret the time stamp, it is necessary to get the values from
#    hook 0x00a, subhook 0x25c, used to convert it to nanoseconds.
#    The 3 data words of interest are all 8 bytes in length and are in
#    the generic buffer, see the template for hook 00A.
#    The first data word gives the multiplier, m, and the second word
#    is the divisor, d.  These values should be set to 1 if the
#    third word doesn't contain a 2.  The nanosecond time is then
#    calculated with nt = t * m / d where t is the time from the trace.
#
#    Also, on a 64-bit system, there will be a header on the trace stream.
#    This header serves to identify the stream as coming from a
#    64-bit system.  There is no such header on the data stream on a
#    32-bit system.  This data stream, on both systems, is produced with
#    the "-o -" option of the trace command.
#    This header consists only of a 4-byte magic number, 0xEFDF1114.
#
# A. Binary format for the 32-bit trace data
#    TRCHKL0        MMMTDDDDiiiiiiii
#    TRCHKL0T       MMMTDDDDiiiiiiiitttttttt
#    TRCHKL1        MMMTDDDD11111111iiiiiiii
#    TRCHKL1T       MMMTDDDD11111111iiiiiiiitttttttt
#    Note that trchkg covers TRCHKL2-TRCHKL5.
#    trchkg       MMMTDDDD1111111122222222333333334444444455555555iiiiiiii
#    trchkgt      MMMTDDDD1111111122222222333333334444444455555555 i... t...
#    trcgent      MMMTLLLL11111111vvvvvvvvvvvvvvvvvvvvvvvvvvxxxxxx i... t...
#
#         legend:
#      MMM = hook id
#      T   = hooktype
#      D   = hookdata
#      i   = thread id, 4 bytes on a 32 byte system and 8 bytes on a 64-bit
#            system. The thread id starts on a 4 or 8 byte boundary.
#      t   = timestamp, 4 bytes on a 32-bit system or 8 on a
#            64-bit system.
#      1   = d1  (see trchkid.h for calling syntax for the tracehook routines)
#      2   = d2, etc.
#      v   = trcgen variable length buffer
#      L   = length of variable length data in bytes.
#
#    The DATA_POINTER starts at the third byte in the event, ie.,
#    at the 16 bit hookdata DDDD.
#    The trcgen() is an exception. The DATA_POINTER starts at
#    the fifth byte, ie., at the 'd1' parameter 11111111.
#
#    Note that a generic trace hook with a hookid of 0x00b is
#    produced for 64-bit data traced from a 64-bit app running on
#    a 32-bit kernel.  Since this is produced on a 32-bit system, the
#    thread id and time stamp will be 4 bytes in the data stream.
#
# B.  64-bit trace hook format
#
#    TRCHK64L0	ffffllllhhhhssss iiiiiiiiiiiiiiii
#    TRCHK64L0T	ffffllllhhhhssss iiiiiiiiiiiiiiii tttttttttttttttt
#    TRCHK64L1	ffffllllhhhhssss 1111111111111111 i...
#    ...
#    TRCGEN   	ffffllllhhhhssss dddddddddddddddd "string" i...
#    TRCGENt  	ffffllllhhhhssss dddddddddddddddd "string" i... t...
#
#    Legend
#      f - flags
#        tgbuuuuuuuuuuuuu: t - time stamped,  g - generic (trcgen),
#          b - 32-bit data, u - unused.
#      l - length, number of bytes traced.
#        For TRCHKL0 llll = 0,
#        for TRCHKL5T llll = 40, 0x28 (5 8-byte words)
#      h - hook id
#       s - subhook id
#       1 - data word 1, ...
#       d - generic trace data word.
#       i - thread id, 8 bytes on a 64-bit system, 4 on a 32-bit system.
#           The thread id starts on an 8-byte boundary.
#       t - time stamp, 8 bytes on a 64-bit system, 4 on a 32-bit system.
#
#    For non-generic entries, the data pointer starts at the
#    subhook id, offset 6.  This is compatible with the 32-bit
#    hook format shown above.
#    For generic (trcgen) hooks, the g flag above is on.  The
#    length shows the number of variable bytes traced and does not include
#    the data word.
#    The data pointer starts at the 64-bit data word.
#    Note that the data word is 64 bits here.
#
# C. Trace environments
#    The trcrpt, trace report, utility must be able to tell whether
#    the trace it's formatting came from a 32 or a 64 bit system.
#    This is accomplished by the log file header's magic number.
#    In addition, we need to know whether 32 or 64 bit data was traced.
#    It is possible to run a 32-bit application on a 64-bit kernel,
#    and a 64-bit application on a 32-bit kernel.
#    In the case of a 32-bit app on a 64-bit kernel, the "b" flag
#    shown under item B above is set on.  The trcrpt program will
#    then present the data as if it came from a 32-bit kernel.
#    In the second case, if the reserved hook id 00b is seen, the data
#    traced by the 32-bit kernel is made to look as if it came
#    from a 64-bit trace.  Thus the templates need not be kernel aware.
#
#    For example, if a 32-bit app uses
#    TRCHKL5T(0x50000005, 1, 2, 3, 4, 5)
#    and is running on a 64-bit kernel, the data actually traced
#    will look like:
#      ffffllllhhhhssss 1111111111111111 2222222222222222 3333333333333333
#      a000001450000005 0000000100000002 0000000300000004 0000000500000000 i t
#    Here, the flags have the T and B bits set (a000) which says
#    the hook is timestamped and from a 32-bit app.
#    The length is 0x14 bytes, 5 4-byte registers 00000001 through
#    00000005.
#    The hook id is 0x5000.
#    The subhook id is 0x0005.
#    i and t refer to the 8-byte thread id and time stamp.
#
#    This would be reformatted as follows before being processed
#    by the corresponding template:
#      500e0005 00000001 00000002 00000003 00000004 00000005
#    Note this is how the data would look if traced on a 32-bit kernel.
#    Note also that the data would be followed by an 8-byte thread id and
#    time stamp.
#
#    Similarly, consider the following hook traced by a 64-bit app
#    on a 32-bit kernel:
#      TRCHKL5T(0x50000005, 1, 2, 3, 4, 5)
#    The data traced would be:
#      00b8002c 80000028 50000005 0000000000000001 ... 0000000000000005 i t
#    Note that this is a generic trace entry, T = 8.
#    In the generic entry, we're using the 32-bit data word for the flags
#    and length.
#    The trcrpt utility would reformat this before processing by
#    the template as follows:
#      8000002850000005 0000000000000001 ... 0000000000000005 i8 t8
#
#    The thread id and time stamp in the data stream will be 4 bytes,
#    because the hook came from a 32-bit system.
#
#    If a 32-bit app traces generic data on a 64-bit kernel, the b
#    bit will be set on in the data stream, and the entry will be formatted
#    like it came from a 32-bit environment, i.e. with a 32-bit data word.
#    For the case of a 64-bit app on a 32-bit kernel, generic trace
#    data is handled in the same manner, with the flags placed
#    into the data word.
#    For example, if the app issues
#      TRCGEN(1, 0x50000005, 1, 6, "hello")
#    The 32-bit kernel trace will generate
#      00b00012 40000006 50000005 0000000000000001 "hello"
#   This will be reformatted by trcrpt into
#      4000000650000005 0000000000000001 "hello"
#   with the data pointer starting at the data word.
#
#   Note that the string "hello" could have been 4096 bytes.  Therefore
#   this generic entry must be able to violate the 4096 byte length
#   restriction.
#
# D. Indentation levels
#    The left margin is set per template using the 'L=XXXX' command.
#    The default is L=KERN, the second column.
#    L=APPL moves the left margin to the first  column.
#    L=SVC  moves the left margin to the second column.
#    L=KERN moves the left margin to the third  column.
#    L=INT  moves the left margin to the fourth column.
#    The command if used must go just after the version code.
#
#    Example usage:
#113 1.7 L=INT "stray interrupt" ... \
#
# E. Continuation code and delimiters.
#    A '\' at the end of the line must be used to continue the template
#      on the next line.
#    Individual strings (labels) can be separated by one or more blanks
#      or tabs. However, all whitespace is squeezed down to 1 blank on
#      the report. Use '\t' for skipping to the next tabstop, or use
#      A0.X format (see below) for variable space.
#
#
# II. FORMAT codes
#
# A. Codes that manipulate the DATA_POINTER
# Gm.n
#     "Goto"    Set DATA_POINTER to byte.bit location m.n
#
# Om.n
#     "Omit"    Advance DATA_POINTER by m.n byte.bits
#
# Rm
#     "Reverse" Decrement DATA_POINTER by m bytes. R0 byte aligns.
#
# Wm
#     Position DATA_POINTER at word m.  The word size is either 4 or 8
#     bytes, depending upon whether or not this is a 32 or 64 bit format
#     trace.  This bares no relation to the %W format specifier.
#
# B. Codes that cause data to be output.
# Am.n
#     Left justified ascii.
#     m=length in bytes of the binary data.
#     n=width of the displayed field.
#     The data pointer is rounded up to the next byte boundary.
#     Example
#      DATA_POINTER|
#                  V
#             xxxxxhello world\0xxxxxx
#
#  i.   A8.16 results in:                       |hello wo        |
#       DATA_POINTER--------|
#                           V
#              xxxxxhello world\0xxxxxx
#
#  ii.  A16.16 results in:                      |hello world     |
#       DATA_POINTER----------------|
#                                   V
#              xxxxxhello world\0xxxxxx
#
#  iii. A16 results in:                         |hello world|
#       DATA_POINTER----------------|
#                                   V
#              xxxxxhello world\0xxxxxx
#
#  iv.  A0.16 results in:                       |                |
#       DATA_POINTER|
#                   V
#              xxxxxhello world\0xxxxxx
#
# Sm (m = 1, 2, 4, or 8)
#     Left justified ascii string.
#     The length of the string is in the first m bytes of
#      the data. This length of the string does not include these bytes.
#     The data pointer is advanced by the length value.
#     SW specifies the length to be 4 or 8 bytes, depending upon whether
#     this is a 32 or 64 bit hook.
#     Example
#      DATA_POINTER|
#                  V
#             xxxxxBhello worldxxxxxx     (B = hex 0x0b)
#
#  i.   S1 results in:                          |hello world|
#       DATA_POINTER-----------|
#                              V
#              xxxxBhello worldxxxxxx
#
# $reg%S1
#     A register with the format code of 'Sx' works "backwards" from
#     a register with a different type. The format is Sx, but the length
#     of the string comes from $reg instead of the next n bytes.
#
# Bm.n
#     Binary format.
#     m = length in bytes
#     n = length in bits
#     The length in bits of the data is m * 8 + n. B2.3 and B0.19 are the same.
#     Unlike the other printing FORMAT codes, the DATA_POINTER
#     can be bit aligned and is not rounded up to the next byte boundary.
#
# Xm
#     Hex format.
#     m = length in bytes. m=0 thru 16
#     X0 is the same as X1, except that no trailing space is output after
#     the data.  Therefore X0 can be used with a LOOP to output an
#     unbroken string of data.
#     The DATA_POINTER is advanced by m (1 if m = 0).
#     XW will format either 4 or 8 bytes of data depending upon whether
#     this is a 32 or 64 bit hook.  The DATA_POINTER is advanced
#     by 4 or 8 bytes.
#
# Dm (m = 2, 4, or 8)
#     Signed decimal format.
#     The length of the data is m bytes.
#     The DATA_POINTER is advanced by m.
#     DW will format either 4 or 8 bytes of data depending upon whether
#     this is a 32 or 64 bit hook.  The DATA_POINTER is advanced
#     by 4 or 8 bytes.
#
# Um (m = 2, 4, or 8)
#     Unsigned decimal format.
#     The length of the data is m bytes.
#     The DATA_POINTER is advanced by m.
#     UW will format either 4 or 8 bytes of data depending upon whether
#     this is a 32 or 64 bit hook.  The DATA_POINTER is advanced
#     by 4 or 8 bytes.
#
# om (m = 2, 4, or 8)
#     Octal format.
#     The length of the data is m bytes.
#     The DATA_POINTER is advanced by m.
#     ow will format either 4 or 8 bytes of data depending upon whether
#     this is a 32 or 64 bit hook.  The DATA_POINTER is advanced
#     by 4 or 8 bytes.
#
# F4
#     Floating point format. (like %0.4E)
#     The length of the data is 4 bytes.
#     The format of the data is that of C type 'float'.
#     The DATA_POINTER is advanced by 4.
#
# F8
#     Floating point format. (like %0.4E)
#     The length of the data is 8 bytes.
#     The format of the data is that of C type 'double'.
#     The DATA_POINTER is advanced by 8.
#
# HB
#     Number of bytes in trcgen() variable length buffer.
#     The DATA_POINTER is not changed.
#
# HT
#   32-bit hooks:
#     The hooktype. (0 - E)
#     trcgen  = 0, trchk  = 1, trchl  = 2, trchkg  = 6
#     trcgent = 8, trchkt = 9, trchlt = A, trchkgt = E
#     HT & 0x07 masks off the timestamp bit
#     This is used for allowing multiple, different trchook() calls with
#       the same template.
#     The DATA_POINTER is not changed.
#   64-bit hooks
#     This is the flags field.
#     0x8000 - hook is time stamped.
#     0x4000 - This is a generic trace.
#
#     Note that if the hook was reformatted as discussed under item
#     I.C above, HT is set to reflect the flags in the new format.
#
# C. Codes that interpret the data in some way before output.
# Tm (m = 4, or 8)
#     Output the next m bytes as a data and time string,
#     in GMT timezone format.  (as in ctime(&seconds))
#     The DATA_POINTER is advanced by m bytes.
#     Only the low-order 32-bits of the time are actually used.
#     TW will format either 4 or 8 bytes of data depending upon whether
#     this is a 32 or 64 bit hook.  The DATA_POINTER is advanced
#     by 4 or 8 bytes.
#
# Em (m = 1, 2, 4, or 8)
#     Output the next m bytes as an 'errno' value, replacing
#     the numeric code with the corresponding #define name in
#     /usr/include/sys/errno.h
#     The DATA_POINTER is advanced by 1, 2, 4, or 8.
#     EW will format either 4 or 8 bytes of data depending upon whether
#     this is a 32 or 64 bit hook.  The DATA_POINTER is advanced
#     by 4 or 8 bytes.
#
# Pm (m = 4, or 8)
#     Use the next m bytes as a process id (pid), and
#     output the pathname of the executable with that process id.
#     Process ids and their pathnames are acquired by the trace command
#     at the start of a trace and by trcrpt via a special EXEC tracehook.
#     The DATA_POINTER is advanced by 4 or 8 bytes.
#     PW will format either 4 or 8 bytes of data depending upon whether
#     this is a 32 or 64 bit hook.
#
# \t
#     Output a tab. \t\t\t outputs 3 tabs. Tabs are expanded to spaces,
#     using a fixed tabstop separation of 8. If L=0 indentation is used,
#     the first tabstop is at 3.
#
# \n
#     Output a newline. \n\n\n outputs 3 newlines.
#     The newline is left-justified according to the INDENTATION LEVEL.
#
# $macro
#     Undefined macros have the value of 0.
#     The DATA_POINTER is not changed.
#     An optional format can be used with macros:
#        $v1%X8    will output the value $v1 in X8 format.
#        $zz%B0.8  will output the value $v1 in 8 bits of binary.
#     Understood formats are: X, D, U, B and W. Others default to X2.
#
#     The W format is used to mask the register.
#     Wm.n masks off all bits except bits m through n, then shifts the
#     result right m bits.  For example, if $ZZ = 0x12345678, then
#     $zz%W24.27 yields 2.  Note the bit numbering starts at the right,
#     with 0 being the least significant bit.
#
# "string"     'string' data type
#     Output the characters inside the double quotes exactly. A string
#     is treated as a descriptor. Use "" as a NULL string.
#
# `string format $macro` If a string is backquoted, it is expanded
#     as a quoted string, except that FORMAT codes and $registers are
#     expanded as registers.
#
# III. SWITCH statement
#     A format code followed by a comma is a SWITCH statement.
#     Each CASE entry of the SWITCH statement consists of
#       1. a 'matchvalue' with a type (usually numeric) corresponding to
#          the format code.
#       2. a simple 'string' or a new 'descriptor' bounded by braces.
#          A descriptor is a sequence of format codes, strings, switches,
#          and loops.
#       3. and a comma delimiter.
#       The switch is terminated by a CASE entry without a comma delimiter.
#     The CASE entry selected is the first entry whose matchvalue
#     is equal to the expansion of the format code.
#     The special matchvalue '\*' is a wildcard and matches anything.
#     The DATA_POINTER is advanced by the format code.
#
#
# IV. LOOP statement
#     The syntax of a 'loop' is
#     LOOP format_code { descriptor }
#     The descriptor is executed N times, where N is the numeric value
#       of the format code.
#     The DATA_POINTER is advanced by the format code plus whatever the
#       descriptor does.
#     Loops are used to output binary buffers of data, so descriptor is
#       usually simply X1 or X0. Note that X0 is like X1 but does not
#       supply a space separator ' ' between each byte.
#
#
# V. macro assignment and expressions
#   'macros' are temporary (for the duration of that event) variables
#   that work like shell variables.
#   They are assigned a value with the syntax:
#   {{ $xxx = EXPR }}
#   where EXPR is a combination of format codes, macros, and constants.
#   Allowed operators are + - / *
#   For example:
#{{ $dog = 7 + 6 }} {{ $cat = $dog * 2 }} $dog $cat
#
#   will output:
#000D 001A
#
#   Macros are useful in loops where the loop count is not always
#   just before the data:
#G1.5 {{ $count = B0.5 }} G11 LOOP $count {X0}
#
#   Up to 255 macros can be defined per template.
#
#
# VI. Special macros:
# $HOOKENV     This is either "32" or "64" depending upon
#              whether this is a 32 or 64 bit trace hook.
#              This can be used to interpret the HT value.
# $TRACEENV    This is either "32" or "64" depending upon
#              whether this is a 32 or 64 bit trace, i.e., whether the
#
         trace was generated by a 32 or 64 bit kernel.
#              Since hooks will be formatted according to the environment
#              they came from, $HOOKENV should normally be used.
# $RELLINENO   line number for this event. The first line starts at 1.
# $D1 - $D5    dataword 1 through dataword 5. No change to datapointer.
#   The data word is either 4 or 8 bytes.
# $L1 - $L5    Long dataword 1,5(64 bits). No change to datapointer.
# $HD          hookdata (lower 16 bits)
#              For a 32-bit generic hook, $HD is the length of the
#              generic data traced.
#              For 32 or 64 bit generic hooks, use $HL.
# $HL          Hook data length.  This is the length in bytes of the hook
#              data.  For generic entries it is the length of the
#              variable length buffer and doesn't include the data word.
# $WORDSIZE    Contains the word size, 4 or 8 bytes, of the current
#              entry, (i.e.) $HOOKENV / 8.
# $GENERIC     specifies whether the entry is a generic entry.  The
#              value is 1 for a generic entry, and 0 if not generic.
#              $GENERIC is especially useful if the hook can come from
#              either a 32 or 64 bit environment, since the types (HT)
#              have different formats.
# $TOTALCPUS   Output the number of CPUs in the system.
# $TRACEDCPUS  Output the number of CPUs that were traced.
# $REPORTEDCPUS Output the number of CPUs active in this report.
#              This can decrease as CPUs stop tracing when, for example,
#              the single-buffer trace, -f, was used and the buffers for
#              each CPU fill up.
# $LARGEDATATYPES This is set to 1 if the kernel is supporting large data
#              types for 64-bit applications.
# $SVC         Output the name of the current SVC
# $EXECPATH    Output the pathname of the executable for current process.
# $PID         Output the current process id.
# $TID         Output the current thread id.
# $CPUID       Output the current processor id.
# $PRI         Output the current process priority
# $ERROR       Output an error message to the report and exit from the
#              template after the current descriptor is processed.
#              The error message supplies the logfile, logfile offset of the
#              start of that event, and the traceid.
# $LOGIDX      Current logfile offset into this event.
# $LOGIDX0     Like $LOGIDX, but is the start of the event.
# $LOGFILE     Name of the logfile being processed.
# $TRACEID     Traceid of this event.
# $DEFAULT     Use the DEFAULT template 008
# $STOP        End the trace report right away
# $BREAK       End the current trace event
# $SKIP        Like break, but don't print anything out.
# $DATAPOINTER The DATA_POINTER. It can be set and manipulated
#              like other user-macros.
#              {{ $DATAPOINTER = 5 }} is equivalent to G5
#
# Note:  For generic trace hooks, $DATAPOINTER points to the
#   data word.  This means it is 0x4 for 32-bit hooks, and 0x8 for
#   64-bit hooks.
#   For non-generic hooks, $DATAPOINTER is set to 2 for 32-bit hooks
#   and to 6 for 64 bit trace hooks.  This means it always
#   points to the subhook id.
#
# $BASEPOINTER Usually 0. It is the starting offset into an event. The actual
#              offset is the DATA_POINTER + BASE_POINTER. It is used with
#              template subroutines, where the parts on an event have the
#              same structure, and can be printed by the same template, but
#              might have different starting points into an event.
# $IPADDR      IP address of this machine, 4 bytes.
# $BUFF        Buffer allocation scheme used, 1=kernel heap, 2=separate segment.
#
# VII. Template subroutines
#    If a macro name consists of 3 hex digits, it is a "template subroutine".
#    The template whose traceid equals the macro name is inserted in place
#    of the macro.
#
#    The data pointer is where it was when the template
#    substitution was encountered. Any change made to the data pointer
#    by the template subroutine remains in affect when the template ends.
#
#    Macros used within the template subroutine correspond to those in the
#    calling template. The first definition of a macro in the called template
#    is the same variable as the first in the called. The names are not
#    related.
#
#    NOTE: Nesting of template subroutines is supported to 10 levels.
#
#    Example:
#    Output the trace label ESDI STRATEGY.
#    The macro '$stat' is set to bytes 2 and 3 of the trace event.
#    Then call template 90F to interpret a buf header. The macro '$return'
#    corresponds to the macro '$rv', because they were declared in the same
#    order. A macro definition with no '=' assignment just declares the name
#    like a place holder. When the template returns, the saved special
#    status word is output and the returned minor device number.
#
#900 1.0 "ESDI STRATEGY" {{ $rv = 0 }} {{ $stat = X2 }} \
#       $90F \n\
#special_esdi_status=$stat for minor device $rv
#
#90F 1.0 "" G4 {{ $return }} \
#       block number X4 \n\
#       byte count   X4 \n\
#       B0.1, 1 B_FLAG0 \
#       B0.1, 1 B_FLAG1 \
#       B0.1, 1 B_FLAG2 \
#       G16 {{ $return = X2 }}
#
#
#    Note: The $DEFAULT reserved macro is the same as $008
#
# VIII. BITFLAGS statement
#     The syntax of a 'bitflags' is
#     BITFLAGS [format_code|register],
#         flag_value string {optional string if false},   or
#         '&' mask field_value string,
#                 ...
#
#     This statement simplifies expanding state flags, because it looks
#       a lot like a series of #defines.
#     The '&' mask is used for interpreting bit fields.
#     The mask is anded to the register and the result is compared to
#       the field_value. If a match, the string is printed.
#     The base is 16 for flag_values and masks.
#     The DATA_POINTER is advanced if a format code is used.
#     Note: the default base for BITFLAGS is 16. If the mask or field value
#     has a leading "o", the number is octal. 0x or 0X makes the number hexadecimal.

Examples of Coding Events and Formatting Events

There are five basic steps involved in generating a trace from your software program.

Step 1: Enable the trace

Enable and disable the trace from your software that has the trace hooks defined. The following code shows the use of trace events to time the running of a program loop.

#include      <sys/trcctl.h>
#include      <sys/trcmacros.h>
#include      <sys/trchkid.h>

char      *ctl_file = "/dev/systrctl";
int      ctlfd;
int      i;

main()
{
        printf("configuring trace collection \n");
        if (trcstart("-ad")){
                perror("trcstart");
                exit(1);
        }
        printf("turning trace on \n");
        if(trcon(0)){
                perror("TRCON");
                exit(1);
        }
        /* here is the code that is being traced */
        for(i=1;i<11;i++){
                TRCHKL1T(HKWD_USER1,i);
                /*   sleep(1) */
                /* you can uncomment sleep to make the loop
                /* take longer. If you do, you will want to
                /* filter the output or you will be */
                /* overwhelmed with 11 seconds of data */
        }
        /* stop tracing code */
        printf("turning trace off\n");
        if(trcstop(0)){
                perror("TRCOFF");
                exit(1);
        }

Step 2: Compile your program

When you compile the sample program, you need to link to the librts.a library:

cc -o sample sample.c -l rts

Step 3: Run the program

Run the program. In this case, it can be done with the following command:

./sample

Step 4: Add a stanza to the format file

This provides the report generator with the information to correctly format your file. The report facility does not know how to format the HKWD_USER1 event, unless you provide rules in the trace format file.

The following is an example of a stanza for the HKWD_USER1 event. The HKWD_USER1 event is event ID 010 hexadecimal. You can verify this by looking at the sys/trchkid.h header file.

# User event HKWD_USER1 Formatting Rules Stanza
# An example that will format the event usage of the sample program
010 1.0 L=APPL "USER EVENT - HKWD_USER1" O2.0 \n\
               "The # of loop iterations =" U4\n\
               "The elapsed time of the last loop = "\
                endtimer(0x010,0x010) starttimer(0x010,0x010)

Note

When entering the example stanza, do not modify the master format file /etc/trcfmt. Instead, make a copy and keep it in your own directory. This allows you to always have the original trace format file available. If you are going to ship your formatting stanzas, the trcupdate command is used to add your stanzas to the default trace format file. See the trcupdate command in AIX 5L Version 5.2 Commands Reference, Volume 5 for information about how to code the input stanzas.

Step 5: Run the format/filter program

Filter the output report to get only your events. To do this, run the trcrpt command:

trcrpt -d 010 -t mytrcfmt -O exec=on -o sample.rpt

The formatted trace results are:

ID PROC NAME  I   ELAPSED_SEC  DELTA_MSEC    APPL     SYSCALL    KERNEL   INTERRUPT
010   sample     0.000105984  0.105984     USER HOOK 1
                                           The data field for the user hook = 1
010   sample     0.000113920  0.007936     USER HOOK 1
                                           The data field for the user hook = 2 [7 usec]
010   sample     0.000119296  0.005376     USER HOOK 1
                                           The data field for the user hook = 3 [5 usec]
010   sample     0.000124672  0.005376     USER HOOK 1
                                           The data field for the user hook = 4 [5 usec]
010   sample     0.000129792  0.005120     USER HOOK 1
                                           The data field for the user hook = 5 [5 usec]
010   sample     0.000135168  0.005376     USER HOOK 1
                                           The data field for the user hook = 6 [5 usec]
010   sample     0.000140288  0.005120     USER HOOK 1
                                           The data field for the user hook = 7 [5 usec]
010   sample     0.000145408  0.005120     USER HOOK 1
                                           The data field for the user hook = 8 [5 usec]
010   sample     0.000151040  0.005632     USER HOOK 1
                                           The data field for the user hook = 9 [5 usec]
010   sample     0.000156160  0.005120     USER HOOK 1
                                           The data field for the user hook = 10 [5 usec]

Usage Hints

The following sections provide some examples and suggestions for use of the trace facility.

Viewing trace Data

Including several optional columns of data in the trace output can cause the output to exceed 80 columns. It is best to view the report on an output device that supports 132 columns. You can also use the -O 2line=on option to produce a more narrow report.

Bracketing Data Collection

Trace data accumulates rapidly. Bracket the data collection as closely around the area of interest as possible. One technique for doing this is to issue several commands on the same command line. For example, the command

trace -a; cp /etc/trcfmt /tmp/junk; trcstop

captures the total execution of the copy command.

Note

This example is more educational if the source file is not already cached in system memory. The trcfmt file can be in memory if you have been modifying it or producing trace reports. In that case, choose as the source file some other file that is 50 to 100 KB and has not been touched.

Reading a trace Report

The trace facility displays system activity. It is a useful learning tool to observe how the system actually performs. The previous output is an interesting example to browse. To produce a report of the copy, use the following:

trcrpt -O "exec=on,pid=on" > cp.rpt

In the cp.rpt file you can see the following activities:

The fork, exec, and page fault activities of the cp process.
The opening of the /etc/trcfmt file for reading and the creation of the /tmp/junk file.
The successive read and write subroutines to accomplish the copy.
The cp process becoming blocked while waiting for I/O completion, and the wait process being dispatched.
How logical volume requests are translated to physical volume requests.
The files are mapped rather than buffered in traditional kernel buffers. The read accesses cause page faults that must be resolved by the virtual memory manager.
The virtual memory manager senses sequential access and begins to prefetch the file pages.
The size of the prefetch becomes larger as sequential access continues.
The writes are delayed until the file is closed (unless you captured execution of the sync daemon that periodically forces out modified pages).
The disk device driver coalesces multiple file requests into one I/O request to the drive when possible.

Effective Filtering of the trace Report

The full detail of the trace data might not be required. You can choose specific events of interest to be shown. For example, it is sometimes useful to find the number of times a certain event occurred. Answer the question, "How many opens occurred in the copy example?" First, find the event ID for the open subroutine:

trcrpt -j | pg

You can see that event ID 15b is the open event. Now, process the data from the copy example (the data is probably still in the log file) as follows:

trcrpt -d 15b -O "exec=on"

The report is written to standard output and you can determine the number of opens that occurred. If you want to see only the opens that were performed by the cp process, run the report command again using:

trcrpt -d 15b -p cp -O "exec=on"

This command shows only the opens performed by the cp process.

Trace Event Groups

Combining multiple trace hooks into a trace event group allows all hooks to be turned on or off at once when starting a trace.

Trace event groups should only be manipulated using either the trcevgrp command, or SMIT. The trcevgrp command allows groups to be created, modified, removed, and listed.

Reserved event groups may not be changed or removed by the trcevgrp command. These are generally groups used to perform system support. A reserved event group must be created using the ODM facilities. Such a group will have three attributes as shown below:

SWservAt:
       attribute = "(name)_trcgrp"
       default = " "
       value = "(list-of-hooks)"

SWservAt:
       attribute = "(name)_trcgrpdesc"
       default = " "
       value = "description"

SWservAt:
       attribute = "(name)_trcgrptype"
       default = " "
       value = "reserved"

The hook IDs must be enclosed in double quotation marks (") and separated by commas.