User threads provide independent flow of control within a process. If the user threads need to access kernel services (such as system calls), the user threads will be serviced by associated kernel threads. User threads are provided in various software packages with the most notable being the pthreads shared library (libpthreads.a). With the libpthreads implementation, user threads sit on top of virtual processors (VP) which are themselves on top of kernel threads. A multithreaded user process can use one of two models, as follows:
Within the libpthreads.a framework, a series of tuning knobs have been provided that might impact the performance of the application. If possible, the application developer should provide a front-end shell script to invoke the binary executable programs, in which the developer may specify new values to override the system defaults. These environment variables are as follows:
The AIXTHREAD_COND_DEBUG varible maintains a list of condition variables for use by the debugger. If the program contains a large number of active condition variables and frequently creates and destroys condition variables, this may create higher overhead for maintaining the list of condition variables. Setting the variable to OFF will disable the list. Leaving this variable turned on makes debugging threaded applications easier, but may impose some overhead.
This variable enables or disables the pthread resource collection. Turning it on allows for resource collection of all pthreads in a process, but will impose some overhead.
For AIX 4.3 and later:
* +-----------------------+ * | pthread attr | * +-----------------------+ <--- pthread->pt_attr * | pthread struct | * +-----------------------+ <--- pthread->pt_stk.st_limit * | pthread stack | * | | | * | V | * +-----------------------+ <--- pthread->pt_stk.st_base * | RED ZONE | * +-----------------------+ <--- pthread->pt_guardaddr * | pthread private data | * +-----------------------+ <--- pthread->pt_data The RED ZONE on this illustration is called the Guardpage.
Starting with AIX 5.2, the pthread attr, pthread, and ctx represent the PTH_FIXED part of the memory allocated for a pthread.
The approximate byte sizes in the diagram below are in [] for 32-bit. For 64-bit, expect the pieces comprising PTH_FIXED to be slightly larger and the key data to be 8 Kb, but otherwise the same.
* +-----------------------+ * | page alignment 2 | * | [8K-4K+PTH_FIXED-a1] | * +-----------------------+ * | pthread ctx [368] | * +-----------------------+<--- pthread->pt_attr * | pthread attr [112] | * +-----------------------+ <--- pthread->pt_attr * | pthread struct [960] | * +-----------------------+ <--- pthread * | pthread stack | pthread->pt_stk.st_limit * | |[96K+4K-PTH_FIXED] | * | V | * +-----------------------+ <--- pthread->pt_stk.st_base * | RED ZONE [4K] | * +-----------------------+ <--- pthread->pt_guardaddr * | pthread key data [4K] | * +-----------------------+ <--- pthread->pt_data * | page alignment 1 (a1) | * | [<4K] | * +-----------------------+ The RED ZONE on this illustration is called the Guardpage.
The decimal number of guard pages to add to the end of the pthread stack is n. It overrides the attribute values that are specified at pthread creation time. If the application specifies its own stack, no guard pages are created. The default is 0 and n must be a positive value.
The guardpage size in bytes is determined by multiplying n by the PAGESIZE. Pagesize is a system determined size.
AIXTHREAD_MNRATIO controls the scaling factor of the library. This ratio is used when creating and terminating pthreads. It ay be useful for applications with a very large number of threads. However, always test a ratio of 1:1 because it may provide for better performance.
This variable maintains a list of active mutexes for use by the debugger. If the program contains a large number of active mutexes and frequently creates and destroys mutexes, this may create higher overhead for maintaining the list of mutexes. Setting the variable to ON makes debugging threaded applications easier, but may impose the additional overhead. Leaving the variable off will disable the list.
Maintains a list of read-write locks for use by the debugger. If the program contains a large number of active read-write locks and frequently creates and destroys read-write locks, this may create higher overhead for maintaining the list of read-write locks. Setting the variable to OFF will disable the list.
P signifies process-wide contention scope (M:N) and S signifies system-wide contention scope (1:1). Either P or S should be specified and the current default is process-wide scope.
The use of this environment variable impacts only those threads created with the default attribute. The default attribute is employed, when the attr parameter to the pthread_create() subroutine is NULL.
If a user thread is created with system-wide scope, it is bound to a kernel thread and it is scheduled by the kernel. The underlying kernel thread is not shared with any other user thread.
If a user thread is created with process-wide scope, it is subject to the user scheduler. It does not have a dedicated kernel thread; it sleeps in user mode; it is placed on the user run queue when it is waiting for a processor; and it is subjected to time slicing by the user scheduler.
Tests on AIX 4.3.2 have shown that certain applications can perform much better with the 1:1 model.
This thread tuning variable controls the number of kernel threads that should be held in reserve for sleeping threads. In general, fewer kernel threads are required to support sleeping pthreads because they are generally woken one at a time. This conserves kernel resources.
The decimal number number of bytes that should be allocated for each pthread. This value may be overridden by pthread_attr_setstacksize.
Malloc buckets provides an optional buckets-based extension of the default allocator. It is intended to improve malloc performance for applications that issue large numbers of small allocation requests. When malloc buckets is enabled, allocation requests that fall within a predefined range of block sizes are processed by malloc buckets. All other requests are processed in the usual manner by the default allocator.
Malloc buckets is not enabled by default. It is enabled and configured prior to process startup by setting the MALLOCTYPE and MALLOCBUCKETS environment variables.
For more information on mallos buckets, see General Programming Concepts: Writing and Debugging Programs.
Multiple heaps are required so that a threaded application can have more than one thread issuing malloc(), free(), and realloc() subroutine calls. With a single heap, all threads trying to do a malloc(), free(), or realloc() call would be serialized (that is only one thread can do malloc/free/realloc at a time). The result is a serious impact on multi-processor machines. With multiple heaps, each thread gets its own heap. If all heaps are being used then any new threads trying to malloc/free/realloc will have to wait (that is serialize) until one or more of the heaps becomes available again. We still have serialization, but its likelihood and impact are greatly reduced.
The thread-safe locking has been changed to handle this approach. Each heap has its own lock, and the locking routine "intelligently" selects a heap to try to prevent serialization. If considersize is set in the MALLOCMULTIHEAP environment variable, then the selection will also try to select any available heap that has enough free space to handle the request instead of just selecting the next unlocked heap.
More than one option can be specified (and in any order) as long as they are comma-separated, for example, MALLOCMULTIHEAP=considersize,heaps:3. The options are:
The default for MALLOCMULTIHEAP is NOT SET (only the first heap is used). If the environment variable MALLOCMULTIHEAP is set (for example, MALLOCMULTIHEAP=1) then the threaded application will be able to use all of the 32 heaps. Setting MALLOCMULTIHEAP=heaps:n will limit the number of heaps to n instead of the 32 heaps.
For more information, see the Malloc Multiheap section in AIX 5L Version 5.2 General Programming Concepts: Writing and Debugging Programs.
Controls the number of times that the system will try to get a busy mutex or spin lock without taking a secondary action such as calling the kernel to yield the process. This control is intended for MP systems, where it is hoped that the lock being held by another actively running pthread will be released. The parameter works only within libpthreads (user threads). The kernel parameter MAXSPIN affects spinning in the kernel lock routines (see The schedtune -s Command). If locks are usually available within a short amount of time, you may want to increase the spin time by setting this environment variable. The number of times to retry a busy lock before yielding to another pthread is n. The default is 40 and n must be a positive value.
Controls the number of times that the system yields the processor when trying to acquire a busy mutex or spin lock before actually going to sleep on the lock. The processor is yielded to another kernel thread, assuming there is another runnable one with sufficient priority. This variable has been shown to be effective in complex applications, where multiple locks are in use. The number of times to yield the processor before blocking on a busy lock is n. The default is 0 and n must be a positive value.
The following environment variables impact the scheduling of pthreads created with process-wide contention scope.
The pthreads library maintains a list of active mutexes, condition variables, and read-write locks for use by the debugger.
When a lock is initialized, it is added to the list, assuming that it is not already on the list. The list is held as a linked list, so determining that a new lock is not already on the list has a performance implication when the list gets large. The problem is compounded by the fact that the list is protected by a lock (dbx__mutexes), which is held across the search of the list. In this case other calls to the pthread_mutex_init() subroutine are held while the search is done.
If the following environment variables are set to OFF, which is the default, then the appropriate debugging list will be disabled completely. That means the dbx command (or any debugger using the pthread debug library) will show no objects in existence.
To set any of these environment variables to ON, use the following command:
# export variable_name=ON
Depending on the type of application, the administrator can choose to use a different thread model. Tests on AIX 4.3.2 have shown that certain applications can perform much better with the 1:1 model. This is an important point because the default as of AIX 4.3.1 is M:N. By simply setting the environment variable AIXTHREAD_SCOPE=S for that process, we can set the thread model to 1:1 and then compare the performance to its previous performance when the thread model was M:N.
If you see an application creating and deleting threads, it could be the kernel threads are being harvested because of the 8:1 default ratio of user threads to kernel threads. This harvesting along with the overhead of the library scheduling can affect the performance. On the other hand, when thousands of user threads exist, there may be less overhead to schedule them in user space in the library rather than manage thousands of kernel threads. You should always try changing the scope if you encounter a performance problem when using pthreads; in many cases, the system scope can provide better performance.
If an application is running on an SMP system, then if a user thread cannot acquire a mutex, it will attempt to spin for up to 40 times. It could easily be the case that the mutex was available within a short amount of time, so it may be worthwhile to spin for a longer period of time. As you add more CPUs, if the performance goes down, this usually indicates a locking problem. You might want to increase the spin time by setting the environment variable SPINLOOPTIME=n, where n is the number of spins. It is not unusual to set the value as high as in the thousands depending on the speed of the CPUs and the number of CPUs. Once the spin count has been exhausted, the thread can go to sleep waiting for the mutex to become available or it can issue the yield() system call and simply give up the CPU but stay in a runnable state rather than going to sleep. By default, it will go to sleep, but by setting the YIELDLOOPTIME environment variable to a number, it will yield up to that many times before going to sleep. Each time it gets the CPU after it yields, it can try to acquire the mutex.
Certain multithreaded user processes that use the malloc subsystem heavily may obtain better performance by exporting the environment variable MALLOCMULTIHEAP=1 before starting the application. The potential performance improvement is particularly likely for multithreaded C++ programs, because these may make use of the malloc subsystem whenever a constructor or destructor is called. Any available performance improvement will be most evident when the multithreaded user process is running on an SMP system, and particularly when system scope threads are used (M:N ratio of 1:1). However, in some cases, improvement may also be evident under other conditions, and on uniprocessors.