[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]

Performance Management Guide

Thread Tuning

User threads provide independent flow of control within a process. If the user threads need to access kernel services (such as system calls), the user threads will be serviced by associated kernel threads. User threads are provided in various software packages with the most notable being the pthreads shared library (libpthreads.a). With the libpthreads implementation, user threads sit on top of virtual processors (VP) which are themselves on top of kernel threads. A multithreaded user process can use one of two models, as follows:

Thread Environment Variables

Within the libpthreads.a framework, a series of tuning knobs have been provided that might impact the performance of the application. If possible, the application developer should provide a front-end shell script to invoke the binary executable programs, in which the developer may specify new values to override the system defaults. These environment variables are as follows:


Controls the number of times that the system will try to get a busy mutex or spin lock without taking a secondary action such as calling the kernel to yield the process. This control is intended for MP systems, where it is hoped that the lock being held by another actively running pthread will be released. The parameter works only within libpthreads (user threads). The kernel parameter MAXSPIN affects spinning in the kernel lock routines (see The schedtune -s Command). If locks are usually available within a short amount of time, you may want to increase the spin time by setting this environment variable. The number of times to retry a busy lock before yielding to another pthread is n. The default is 40 and n must be a positive value.


Controls the number of times that the system yields the processor when trying to acquire a busy mutex or spin lock before actually going to sleep on the lock. The processor is yielded to another kernel thread, assuming there is another runnable one with sufficient priority. This variable has been shown to be effective in complex applications, where multiple locks are in use. The number of times to yield the processor before blocking on a busy lock is n. The default is 0 and n must be a positive value.


P signifies process-wide contention scope (M:N) and S signifies system-wide contention scope (1:1). Either P or S should be specified and the current default is process-wide scope.

The use of this environment variable impacts only those threads created with the default attribute. The default attribute is employed, when the attr parameter to the pthread_create() subroutine is NULL.

If a user thread is created with system-wide scope, it is bound to a kernel thread and it is scheduled by the kernel. The underlying kernel thread is not shared with any other user thread.

If a user thread is created with process-wide scope, it is subject to the user scheduler. It does not have a dedicated kernel thread; it sleeps in user mode; it is placed on the user run queue when it is waiting for a processor; and it is subjected to time slicing by the user scheduler.

Tests on AIX 4.3.2 have shown that certain applications can perform much better with the 1:1 model.


The decimal number of guard pages to add to the end of the pthread stack is n. It overrides the attribute values that are specified at pthread creation time. If the application specifies its own stack, no guard pages are created. The default is 0 and n must be a positive value.


Multiple heaps are required so that a threaded application can have more than one thread issuing malloc(), free(), and realloc() subroutine calls. With a single heap, all threads trying to do a malloc(), free(), or realloc() call would be serialized (that is only one thread can do malloc/free/realloc at a time). The result is a serious impact on multi-processor machines. With multiple heaps, each thread gets its own heap. If all heaps are being used then any new threads trying to malloc/free/realloc will have to wait (that is serialize) until one or more of the heaps becomes available again. We still have serialization, but its likelihood and impact are greatly reduced.

The thread-safe locking has been changed to handle this approach. Each heap has its own lock, and the locking routine "intelligently" selects a heap to try to prevent serialization. If considersize is set in the MALLOCMULTIHEAP environment variable, then the selection will also try to select any available heap that has enough free space to handle the request instead of just selecting the next unlocked heap.

More than one option can be specified (and in any order) as long as they are comma-separated, for example, MALLOCMULTIHEAP=considersize,heaps:3. The options are:

The number of heaps used can be changed with this option. If n is not valid (that is, n<=0 or n>32), 32 is used.

Uses a different heap-selection algorithm that tries to minimize the working set size of the process. The default is not to consider it and use the faster algorithm.

The default for MALLOCMULTIHEAP is NOT SET (only the first heap is used). If the environment variable MALLOCMULTIHEAP is set (for example, MALLOCMULTIHEAP=1) then the threaded application will be able to use all of the 32 heaps. Setting MALLOCMULTIHEAP=heaps:n will limit the number of heaps to n instead of the 32 heaps.

Variables for Process-Wide Contention Scope

The following environment variables impact the scheduling of pthreads created with process-wide contention scope.

where k is the number of kernel threads that should be employed to handle p runnable pthreads. This environment variable controls the scaling factor of the library. This ratio is used when creating and terminating pthreads. The variable is only valid with process-wide scope; with system-wide scope, this environment variable is ignored. The default setting is 8:1.

where k is the number of kernel threads that should be held in reserve for p sleeping pthreads. The sleep ratio is the number of kernel threads to keep on the side in support of sleeping pthreads. In general, fewer kernel threads are required to support sleeping pthreads, since they are generally woken one at a time. This conserves kernel resources. Any positive integer value may be specified for p and k. If k>p, then the ratio is treated as 1:1. The default is 1:12.

where n is the minimum number of kernel threads that should be used. The library scheduler will not reclaim kernel threads below this figure. A kernel thread may be reclaimed at virtually any point. Generally, a kernel thread is targeted for reclaim as a result of a pthread terminating. The default is 8.

Thread Debug Options

The pthreads library maintains a list of active mutexes, condition variables, and read-write locks for use by the debugger.

When a lock is initialized, it is added to the list, assuming that it is not already on the list. The list is held as a linked list, so determining that a new lock is not already on the list has a performance implication when the list gets large. The problem is compounded by the fact that the list is protected by a lock (dbx__mutexes), which is held across the search of the list. In this case other calls to the pthread_mutex_init() subroutine are held while the search is done.

If the following environment variables are set to OFF (the default is ON), then the appropriate debugging list will be disabled completely. That means the dbx command (or any debugger using the pthread debug library) will show no objects in existence.

To change any of these environment variables, use the following command:

# export variable_name=OFF

Thread Tuning Summary

Depending on the type of application, the administrator can choose to use a different thread model. Tests on AIX 4.3.2 have shown that certain applications can perform much better with the 1:1 model. This is an important point because the default as of AIX 4.3.1 is M:N. By simply setting the environment variable AIXTHREAD_SCOPE=S for that process, we can set the thread model to 1:1 and then compare the performance to its previous performance when the thread model was M:N.

If you see an application creating and deleting threads, it could be the kernel threads are being harvested because of the 8:1 default ratio of user threads to kernel threads. This harvesting along with the overhead of the library scheduling can affect the performance. On the other hand, when thousands of user threads exist, there may be less overhead to schedule them in user space in the library rather than manage thousands of kernel threads. You should always try changing the scope if you encounter a performance problem when using pthreads; in many cases, the system scope can provide better performance.

If an application is running on an SMP system, then if a user thread cannot acquire a mutex, it will attempt to spin for up to 40 times. It could easily be the case that the mutex was available within a short amount of time, so it may be worthwhile to spin for a longer period of time. As you add more CPUs, if the performance goes down, this usually indicates a locking problem. You might want to increase the spin time by setting the environment variable SPINLOOPTIME=n, where n is the number of spins. It is not unusual to set the value as high as in the thousands depending on the speed of the CPUs and the number of CPUs. Once the spin count has been exhausted, the thread can go to sleep waiting for the mutex to become available or it can issue the yield() system call and simply give up the CPU but stay in a runnable state rather than going to sleep. By default, it will go to sleep, but by setting the YIELDLOOPTIME environment variable to a number, it will yield up to that many times before going to sleep. Each time it gets the CPU after it yields, it can try to acquire the mutex.

Certain multithreaded user processes that use the malloc subsystem heavily may obtain better performance by exporting the environment variable MALLOCMULTIHEAP=1 before starting the application. The potential performance improvement is particularly likely for multithreaded C++ programs, because these may make use of the malloc subsystem whenever a constructor or destructor is called. Any available performance improvement will be most evident when the multithreaded user process is running on an SMP system, and particularly when system scope threads are used (M:N ratio of 1:1). However, in some cases, improvement may also be evident under other conditions, and on uniprocessors.

[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]