Symmetrical Multiprocessor (SMP) Concepts and Architecture

AIX Versions 3.2 and 4 Performance Tuning Guide

Symmetrical Multiprocessor (SMP) Concepts and Architecture

As with any change that increases the complexity of the system, the use of multiple processors generates design considerations that must be addressed for satisfactory operation and performance. The additional complexity gives more scope for hardware/software tradeoffs and requires closer hardware/software design coordination than in uniprocessor systems. The different combinations of design responses and tradeoffs give rise to a wide variety of multiprocessor system architectures.

This section describes the main design considerations of multiprocessor systems and the responses of AIX and the RS/6000 to those considerations.

Perhaps the most fundamental decision in designing a multiprocessor system is whether the system will be symmetrical or asymmetrical.

The major design considerations are:

Symmetrical vs Asymmetrical Multiprocessors
Data Serialization
Lock Granularity
Locking Overhead
Cache Coherency
Processor Affinity
Memory and Bus Contention

Symmetrical vs Asymmetrical Multiprocessors

In an asymmetrical multiprocessor system, the processors are assigned different roles. One processor may handle I/O, while others execute user programs, and so forth. Some of the advantages and disadvantages of this approach are:

By restricting certain operations to a single processor, some forms of data serialization and cache coherency problems (see below) can be reduced or avoided. Some parts of the software may be able to operate as though they were running in a uniprocessor.
In some situations, I/O-operation or application-program processing may be faster because it does not have to contend with other parts of the operating system or the workload for access to a processor.
In other situations, I/O-operation or application-program processing can be slowed because not all of the processors are available to handle peak loads.
The existence of a single processor handling specific work creates a unique point of failure for the system as a whole.

In a symmetrical multiprocessor system, all of the processors are essentially identical and perform identical functions:

All of the processors work with the same virtual and real address spaces.
Any processor is capable of running any thread in the system.
Any processor can handle any external interrupt. (Each processor handles the internal interrupts generated by the instruction stream it is executing.)
Any processor can initiate an I/O operation.

This interchangeability means that all of the processors are potentially available to handle whatever needs to be done next. The cost of this flexibility is primarily borne by the hardware and software designers, although symmetry also makes the limits on the multiprocessability of the workload more noticeable, as we shall see.

The RS/6000 family contains, and AIX Version 4 supports, only symmetrical multiprocessors, one form of which is shown in the figure "Symmetrical Multiprocessor System." Different systems may have different cache configurations.

Although RS/6000 multiprocessor systems are technically symmetrical, a minimal amount of asymmetry is introduced by the software. A single processor is initially in control during the boot process. This first processor to be started is designated as the "master processor." To ensure that user-written software continues to run correctly during the transition from uniprocessor to multiprocessor environments, device drivers and kernel extensions that do not explicitly describe themselves as able to run safely on multiple processors are forced to run only on the master processor. This constraint is called "funnelling."

Data Serialization

Any storage element that can be read or written by more than one thread may change while the program is running. This is generally true of multiprogramming environments as well as multiprocessing environments, but the advent of multiprocessors adds to the scope and importance of this consideration in two ways:

Multiprocessors and thread support make it attractive and easier to write applications that share data among threads.
The kernel can no longer solve the serialization problem simply by disabling interrupts.

To avoid disaster, programs that share data must arrange to access that data serially, rather than in parallel. Before a program touches a shared data item, it must ensure that no other program (including another copy of itself running on another thread) will change the item.

The primary mechanism that is used to keep programs from interfering with one another is the lock. A lock is an abstraction that represents permission to access one or more data items. Lock and unlock requests are atomic; that is, they are implemented in such a way that neither interrupts nor multiprocessor access affect the outcome. All programs that access a shared data item must obtain the lock that corresponds to that data item before manipulating it. If the lock is already held by another program (or another thread running the same program), the requesting program must defer its access until the lock becomes available.

Besides the time spent waiting for the lock, serialization adds to the number of times a thread becomes nondispatchable. While the thread is nondispatchable, other threads are probably causing the nondispatchable thread's cache lines to be replaced, which will result in increased memory-latency costs when the thread finally gets the lock and is dispatched.

The AIX kernel contains many shared data items, so it must perform serialization internally. This means that serialization delays can occur even in an application program that does not share data with other programs, because the kernel services used by the program have to serialize on shared kernel data.

Lock Granularity

A programmer working in a multiprocessor environment must decide how many separate locks should be created for shared data. If there is a single lock to serialize the entire set of shared data items, lock contention is comparatively likely. If each distinct data item has its own lock, the probability of two threads contending for that lock is comparatively low. Each additional lock and unlock call costs processor time, however, and the existence of multiple locks makes a deadlock possible. At its simplest, deadlock is the situation shown in the figure "Deadlock," in which Thread 1 owns Lock A and is waiting for Lock B, while Thread 2 owns Lock B and is waiting for Lock A. Neither program will ever reach the unlock call that would break the deadlock. The usual preventive for deadlock is to establish a protocol by which all of the programs that use a given set of locks must always acquire them in exactly the same sequence.

Locking Overhead

Requesting locks, waiting for locks, and releasing locks add processing overhead in several ways:

A program that supports multiprocessing always does the same lock and unlock processing, even though it is running in a uniprocessor or is the only user in a multiprocessor system of the locks in question.
When one thread requests a lock held by another thread, the requesting thread may spin for a while or be put to sleep and, if possible, another thread dispatched. This consumes processor time.
The existence of widely used locks places an upper bound on the throughput of the system. For example, if a given program spends 20% of its execution time holding a mutual-exclusion lock, at most 5 instances of that program can run simultaneously, regardless of the number of processors in the system. In fact, even 5 instances would probably never be so nicely synchronized as to avoid waiting on one another (see "Multiprocessor Throughput Scalability").

Cache Coherency

In designing a multiprocessor, engineers give considerable attention to ensuring cache coherency. They succeed; but their success is not free. To understand why cache coherency has a performance cost, we need to understand the problem being attacked:

If each processor has a cache (see the "Symmetrical Multiprocessor System" figure), which reflects the state of various parts of memory, it is possible that two or more caches may have copies of the same line. It is also possible that a given line may contain more than one lockable data item. If two threads make appropriately serialized changes to those data items, the result could be that both caches end up with different, incorrect versions of the line of memory; that is, the system's state is no longer coherent -- the system contains two different versions of what is supposed to be the content of a specific area of memory.

The solutions to the cache coherency problem usually include invalidating all but one of the duplicate lines. Although the invalidation is done by the hardware, without any software intervention, any processor whose cache line has been invalidated will have a cache miss, with its attendant delay, the next time that line is addressed.

For a detailed background discussion of RS/6000 addressing architecture and cache operation, see Appendix C. "Cache and Addressing Considerations."

Processor Affinity

If a thread is interrupted and later redispatched to the same processor, there may still be lines in that processor's cache that belong to the thread. If the thread is dispatched to a different processor, it will probably experience a series of cache misses until its cache working set has been retrieved from RAM. On the other hand, if a dispatchable thread has to wait until the processor it was previously running on is available, the thread may experience an even longer delay.

Processor affinity is the dispatching of a thread to the processor that was previously executing it. The degree of emphasis on processor affinity should vary directly with the size of the thread's cache working set and inversely with the length of time since it was last dispatched.

In AIX Version 4, processor affinity can be achieved by binding a thread to a processor. A thread that is bound to a processor can run only on that processor, regardless of the status of the other processors in the system.

Memory and Bus Contention

In a uniprocessor, contention for some internal resources, such as banks of memory and I/O or memory buses, is usually a minor component processing time. In a multiprocessor these effects can become more significant, particularly if cache-coherency algorithms add to the number of accesses to RAM.