[ Previous | Next | Contents | Glossary | Search ]
Performance Toolbox Version 1.2 and 2 for AIX: Guide and Reference

Chapter 14. Data Reduction and Alarms with filtd

This chapter provides information about reducing data and alarms with the filtd program.

Overview of the filtd Program

The filtd program is designed to run as a daemon. It takes three command line arguments, all of which are optional:

filtd [-f config_file] [-b buffer_size] [-p trace_level]

Command Line Arguments for filtd:

-f
Overrides the default configuration file name. If this option is not given, the file name is assumed to be available in /etc/perf/filter.cf or else as described in "Overview of File Placement" . The configuration file is where you tell filtd what data reduction and alarm definitions you want.
-p
Specifies the level of detail written to the log file. The trace level must be between 1 and 9. The higher the trace level the more is written to the log file. If this option is not specified, the trace level is set to zero.
-b
Buffer size for communications with xmservd via RSI. The default buffer of 2048 bytes will allow for up to 60 statistics to be used in defining new statistics and alarms. If more are needed, the buffer size must be increased. It may also be necessary to increase the xmservd buffer size.

filtd Configuration File

When filtd is started, it immediately issues an RSiOpen() call (see "RSiOpen subroutine" ) to register with the local xmservd daemon. This causes xmservd to start if it is not already running. Following a successful connection to xmservd, filtd then reads the configuration file and parses the information you supplied in the file.

The configuration file contains expressions, which either define new statistics from existing ones or define alarms from statistics. Each time the name of a statistic is encountered while parsing an expression, it is checked with the xmservd daemon whether it is valid. If not, the entire expression is discarded and filtd proceeds to parsing the next expression in the configuration file, if any. Errors detected are reported to the log file.

When all expressions have been parsed, filtd processes all expressions that define new statistics. First it registers the subscription for statistics it needs to build the new ones with xmservd. Then it registers with xmservd as a dynamic data supplier. At this point, filtd is both a consumer and a supplier of statistics. At the end of this initialization phase, filtd instructs xmservd to start feeding the statistics it subscribed to.

The next phase runs through any alarm definitions. No new statistics are defined at this point, but because this is the last of the initialization phases, alarms may refer to statistics that are defined by the previous phase.

Sampling Interval

Whenever new statistics are defined through the filtd configuration file, raw data statistics are initially requested from xmservd every five seconds. As long as no data-consumer program subscribes to the new statistics, the sampling interval remains at five seconds or some smaller value as required to meet the minimum requirements for alarm duration as described in "Alarm Duration and Frequency" .

When other data-consumer programs subscribe to one or more of the new statistics, the sampling interval is adjusted to match the data-consumer program that requires the fastest sampling. Again, if the requirements of an alarm's duration dictates a smaller interval, that is selected.

For most purposes, sampling intervals can safely be set at two seconds or more. Be aware that if you have defined thirty new statistics but subscribe to only one, all thirty are calculated each time you are sent a data feed for the one you subscribe to.

Automatic Start of filtd

Since filtd is a dynamic data-supplier program, you may want to always have it running when the xmservd daemon runs. You can cause this to happen if you add a line to the xmservd configuration file, specifying the full path name of the filtd program and any command line arguments. For example:

supplier: /usr/bin/filtd -p5

Termination of filtd

The filtd daemon can be terminated by killing its process (but don't use kill -9 ). The daemon will terminate itself if it has not received data_feed packets from xmservd for 10 times the data feed interval. This ensures that filtd is terminated whenever xmservd is.

Data Reduction

Although we use the term data reduction, you can actually use the data reduction facilities of filtd to do exactly the opposite. You can define as many new statistics as you want to. However, we anticipate that the most common use of the data reduction facility will be to reduce a large number of statistics to a reasonable set of combined values.

Whether you define lots of new statistics or combine existing ones into fewer new ones, you do it by entering expressions into the configuration file. The general format of expressions for defining new statistics is shown in the following example:

target = expression description
target:
 Unqualified name of non-existing variable. Must start with alpha
 and contain only alpha-numeric characters and percent sign.
expression:
 {variable|wildcard|const} operator {variable|wildcard|const}
variable:
 Fully qualified xmperf variable name with slashes replaced by
 underscores; valid names have at least one underscore. The first
 name component must start with an alpha character, subsequent
 ones may also begin with a percent sign. All must contain only
 alpha-numeric characters, a percent sign, a tilde (~

), a period,
 an underscore preceded by an escape character (the backslash
 '/'), or a wildcard. The referenced  variable must already exist
 (can NOT be defined in this configuration file).
wildcard:
 Fully qualified xmperf variable name with slashes replaced by
 underscores; valid names have at least one underscore. The first
 name component must start with an alpha character, subsequent
 ones may also begin with a percent sign. All must contain only
 alpha-numeric characters and percent sign or must be a wildcard
 The wildcard character must appear in place of a context name,
 must only appear once, and must be one of the characters '+',
 '*', '#', '>
', '<'. operator: one of {*, /, %, +, } const:
[[digits].]digits description: text describing the defined target
variable. must be enclosed in double quotes. the length of the
text should not exceed 64 characters. 

The expression can contain as many parentheses as are required to make the expression unambiguous. It is a good idea to use parentheses liberally if you are in doubt. If you are uncertain how your expression is interpreted, run the program with the command line option -p5. This writes the interpretation of the expression to the log file. If the interpretation is not what you intended, add parentheses.

All numeric constants you specify in an expression are evaluated as floating-point numbers. Similarly, the resulting new statistics (the "target" statistics) are always defined as floating-point numbers.

All new statistics are added to the context called DDS/IBM/Filters so that a new statistic called "avgload" would be known to data-consumer programs as DDS/IBM/Filters/avgload .

Wildcards

The use of wildcards is a way of referring to multiple instances of a given statistic with one name but, more important, it makes your expression independent of the actual configuration of the system it is used on. For example, the expression:

allreads= Disk_+_rblk

could evaluate to different expressions on different machines, such as:

allreads =((Disk/cd0/rblk + Disk/hdisk1/rblk) + Disk/hdisk0/rblk)
allreads = Disk/hdisk0/rblk

The possible wildcard characters and their meaning are as follows:

+
All values matching the wildcard are added together.
*
All values matching the wildcard are multiplied with each other. Note that unless all the values are non-zero, the result will be zero.
#
Evaluates to a constant, which is the number of values that match the wildcard.
>
Evaluates to the maximum value of all those matching the wildcard.
Evaluates to the minimum value of all those matching the wildcard.

Quantities and Counters

As described in the discussion of how to define statistics in System Performance Measurement Interface API a statistic provided by the SPMI is either of type SiCounter or of type SiQuantity. You can combine the two types in expressions to define new statistics, but the resulting statistics as added by filtd are always defined as of type SiQuantity.

This has consequences you need to understand in order to define and interpret new statistics. To see how it works, assume you have a raw statistics value defined as a counter. If data feeds for a raw statistic from xmservd called "widgets" are received with an interval of two seconds, we might get the results illustrated in the following table:

  Elapsed    Counter     Delta     Calculated
  seconds      value     value    rate/second
  -------    -------   -------    -----------
        0     33,206  
        2     33,246        40             20
        4     33,296        50             25
        6     33,460       164             82
           8     33,468         8              4
       10     33,568       100             50

If you define a new statistic with the expression:

gadgets = widgets

and use xmperf to monitor this new statistic, you will always see the rate as it was calculated when the latest data feed was received. The following table shows what you see with different viewing intervals:

Elapsed  Interval  Interval   Interval   Raw rate at
seconds  1 second  2 seconds  4 seconds  4 seconds
-------  -------- ---------   ---------  ---------
   1        ?
   2       20        20
   3       20
   4       25        25          25        23
   5       25
   6       82        82
   7       82
   8        4         4           4        43
   9        4
  10       50        50

The last column in the above table shows what the values would have been at four-second intervals if the raw counter value had been used to arrive at the average rate. Obviously, you need to take this into consideration when you define new statistics. The best way is to standardize the intervals you use.

To summarize, when new values are defined by you, any raw values of type SiQuantity are used as they are while the latest calculated rate per second is used for raw values of type SiCounter.

Data Reduction Delay

Because filtd must read the raw statistics before it can calculate the values of the new ones, the new statistics are always one "cycle" behind the raw statistics. An xmperf instrument that plots a statistic you defined along with the raw statistics used to calculate it always shows a time lag between the new value and the raw ones. This is obvious when the filtd program receives data feeds at the same speed as the xmperf instrument does, however, whether you see it or not, the delay is always effective.

If you want to see what it looks like, put only the following line in the filtd configuration file:

user = CPU_cpu0_user

and then define an instrument in xmperf to display the values:

CPU/cpu0/user
DDS/IBM/Filters/user

Data Reduction Examples

The xmservd daemon divides usage of the CPU resource on IBM RS/6000 systems into four groups: kernel, user, wait, and idle. If you wanted to present it as only two: busy and notbusy, you could define those two new statistics with the following expressions.

busy = CPU_cpu0_kern + CPU_cpu0_user "CPU running"
notbusy = CPU_cpu0_wait + CPU_cpu0_idle "CPU not running"

If you want to see the average number of bytes per transmitted packet for an IP interface, your expression would be:

packsize = IP/NetIf_tr0_ooctet / IP/NetIf_tr0_opacket \
           "Average packet size"

In the above example, the divisor may very well be zero quite often. Whenever a division by zero is attempted, the resulting value is set to zero. The example also shows that expressions can be continued over more than one line by terminating each line except the last one with a \ (backslash).

If you want to see how large a percentage of the network packets are using the loopback interface in your system, try a definition like the following:

localpct = (IP/NetIf_lo0_ipacket + IP/NetIf_lo0_opacket) * 100 \
           / (IP/NetIf_+_ipacket + IP/NetIf_+_opacket)   \
           "Percent of network packets on loopback interface"

The above is an illustration of the usefulness of wildcards. Another, more advanced use of wildcards is shown below. The new value is readdistr and will hold the average percent of reads from all disks expressed as a percentage of the reads from the disk that had the most reads.

readdistr = (Disk_+_rblk / Disk_#_rblk) * 100 / (Disk_>
_rblk) \
     "Average disk reads in percent of most busy disk"

Rounding

All calculations are done in floating-point. Rounding occurs when a data-consumer program defines the receiving field as SiLong. Most data-consumer programs use the standard function RSiGetValue() to retrieve the fields. This function rounds the data values when they are retrieved. If you display raw values that are supplied in floating-point and values computed from these values, then you may get rounded values, which seem to be wrong.

For example, two raw values may be 4.3 and 2.4, which would normally be displayed as 4 and 2, but the product computed by filtd would be 4.3 x 2.4 = 10.32 (rounded to 10 when displayed) rather than 4 x 2 = 8.

Defining Alarms

An alarm consists of an action part that describes what action to trigger, and a condition part that defines the conditions for triggering the alarm. The general format for an alarm definition is shown in the following example:

action condition description
action:
 @alarm: alarm_definition:
 Symbolic name of an alarm. Must start with '@' and otherwise
 contain only alphanumeric characters.
alarm_definition:
 One or more of: [command line], {TRAPxx}, {EXCEPTION}. See
 text following this figure for detailed description.
condition:
 bool_expression [DURATION seconds] [FREQUENCY minutes] [SEVERITY xx]
bool_expression:
 {evariable|wildcard|const} bool_operator
 {evariable|wildcard|const} ...
evariable:
 Fully qualified xmperf variable name with slashes replaced by
 underscores; valid names have at least one underscore. The first
 name component must start with an alpha character, subsequent
 ones may also begin with a percent sign. All must contain only
 alpha-numeric characters, a percent sign, a tilde (~

), a period,
 or an underscore preceded by an escape character (the backslash
 '\', or a wildcard. The referenced variable may be defined by
 this same filter, in which case it must be specified as:
 DDS_IBM_Filters_target, where "target" is the name of the new
 statistic.
wildcard:
 Fully qualified xmperf variable name with slashes replaced by
 underscores; valid names have at least one underscore. The first
 name component must start with an alpha character, subsequent
 ones may also begin with a percent sign. All must contain only
 alpha-numeric characters and percent sign or must be a wildcard.
 The wildcard character must appear in place of a context name,
 must only appear once, and must be one of the characters '+',
  '*', '#', '>
', '<' bool_operator: one of {*, /, %, +, , &&, ||,="=,"
!=",">
, >
=, <, <="}" const: [[digits].]digits description: text
describing the alarm. must be enclosed in double quotes. the text
can not be more than 512 bytes in length. 

Alarm Definition

An alarm can define up to three actions to take place when the alarm condition is met. These three actions are:

[command line]
A command line to be executed when the alarm condition is met. The command line must be enclosed in square brackets. The command line is always executed in the background and with the same credentials as that of the filtd daemon. If the filtd daemon has been started by xmservd, the command line is executed with root authority.
{TRAPxx}
This action can always be specified but it only produces the desired results if the xmservd daemon is configured to export its statistics to the snmpd daemon through the xmservd/SMUX interface described in the SNMP Multiplex Interface .
Note: The interface to SMUX is only available on RS/6000 Agents.

If xmservd does talk to the snmpd daemon, this type of action will, when the defined condition becomes true, produce an SNMP trap that is passed on through xmservd to snmpd and, eventually, to an SNMP manager such as IBM NetView. The keyword TRAP must be in capital letters and must be followed by one or more decimal digits defining the trap number. Both the keyword and the trap number must be enclosed in curly braces. The trap sent to snmpd is an enterprise-specific trap (generic type 6) with a specific trap number equal to the number specified after the TRAP keyword.

{EXCEPTION}
This action causes the filtd daemon to inform the xmservd daemon each time the defined condition is met. This makes xmservd send a message of type except_rec to all hosts having declared that they want to receive such messages. Section "Requesting Exception Messages" explains how a data-consumer program can request to be informed about exceptions. The message contains the identification, description and other data from the alarm definition. The exact layout of the message is declared in the file /usr/include/sys/Spmidef.h as Exception_Rec and is included in the union of message types in file /usr/include/sys/Rsi.h .

Alarm Duration and Frequency

The two keywords DURATION and FREQUENCY are used to determine how long time a condition must remain

true
            Default    Minimum
           ----------  ---------
DURATION   60 seconds  1 second
FREQUENCY  30 minutes  1 minute

For an alarm to be triggered, at least FREQUENCY minutes must have elapsed since the last time this same alarm was triggered. When this is the case, the condition is monitored constantly. Each time the condition switches from false to true, a time stamp is taken. As long as the condition stays true, the elapsed time since the last time stamp is compared to DURATION and, if it equals or exceeds DURATION, the alarm is triggered.

When it can be done without forcing the data feed interval to become less than one second, filtd makes sure at least three data feeds will be taken in DURATION seconds. This is done by modifying the data feed interval, if necessary. Doing this can have side effects on new statistics you have defined, since there's only one data feed interval in use for all raw statistics received by the filtd program, whether the raw statistics are used to define new statistics, to define alarms, or both.

Alarm Severity

A severity code can be associated with an alarm. This is intended to be used when you define one of the actions that result from an alarm shall be to send an except_rec to a data-consumer program. Unfortunately, there's no way to associate a severity level with an SNMP trap.

If you do not specify a severity code, a default of 1 is used. Severity can currently be specified as a value from 0 to 10. The higher the value, the more severe the alarm.

Examples of Alarm Definitions

Alarms need not really be alarms. It would be much nicer if the conditions that would normally trigger an alarm could cause corrective action to be taken without human intervention. One example of such corrective action is that of increasing the UDP receive buffers in case of UDP overrun. You could do this with the following "alarm" definition:

@udpfull:[no -o sb_max=262144] UDP_fullsock >
 5 DURATION 1

If you wanted an SNMP trap with specific number 31 to be sent in addition to the execution of the no command, you would define the alarm as:

@udpfull:[no -o sb_max=262144]{TRAP31} UDP_fullsock >
 5 DURATION 1 \
         "Another UDP buffer overrun"

If you wanted to be informed whenever the paging space on your host has less than 10 percent free space or there's less than 100 pages free paging space, you could use an alarm definition like the following:

Our final example defines an alarm to send an except_rec to interested data-consumer programs whenever the average busy percent for the disks exceeds 50 for more than 5 seconds:

@diskbusy:{EXCEPTION} (Disk_+_busy) / (Disk_#_busy) >
 50 DURATION 5 \
           SEVERITY 3 "Disks are more than 50% busy on average"

Using Raw Values and Delta Values

Section "Quantities and Counters" explains the consequences of using counter values when you define new statistics. For an SiCounter statistic, the filtd daemon always assumes that you are referring to the rate per second if you specify the path name of the statistic with no suffix. This means that of the two value fields in a data_feed packet, the one that contains the delta value is used and divided by the time interval covered by the packet to arrive at the rate.

If this is not what you want, one of two available suffixes can be added to the path name to take the corresponding value and not divide it with the time interval. Those two suffixes are:

@Raw
When you use this suffix, the value used to construct the new statistic is the raw counter value of the counter if the statistic is of type SiCounter. If this suffix is used for statistics of type SiQuantity you will see no difference from not using a suffix.
@Delta
When used for statistics of type SiQuantity, the value used to construct the new statistic is undefined. Don't use this suffix for quantities. When used for SiCounter type statistics, the value used to construct the new statistic is the delta value as it appears in the data_feed packet. The value is not divided by the time interval.

The suffixes do not change the anomalies explained in section "Quantities and Counters" . They are available because the raw data values or the delta values may be useful in other contexts. It is strongly suggested that any use of the suffixes is thoroughly tested before the results are made available to end users.

To illustrate the use of the suffixes, a few examples follow. The first example shows how to define a new statistic that contains the change in the counter of ticks in user mode:

userdelt = CPU_cpu0_uticks@Delta

The value of "userdelt" is the change of the counter value over the time interval. If the sampling interval is 5 seconds, the value will be approximately five times the average percent user CPU over the interval. If we wanted to see the absolute counter value for the number of ticks in kernel mode, we could say:

userkern = CPU_cpu0_kticks@Raw

The final example defines one new statistic and an alarm that will be triggered when the counter of CPU idle ticks wraps. It looks like this:

idleraw = CPU_cpu0_iticks@Raw
@wrap:{EXCEPTION} CPU_cpu0_iticks@Raw dds_ibm_filters_idleraw \ severity 8 
duration 1 "idle counter wrapped" 

The trick here is that the statistic idleraw is one cycle behind the statistic from which it is derived. Therefore, a wrap can be detected as shown.


[ Previous | Next | Contents | Glossary | Search ]