[ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]

Performance Management Guide

Performance Overview of Fixed-Disk Storage Management

The following illustration shows the hierarchy of structures used by the operating system to manage fixed-disk storage. Each individual disk drive, called a physical volume (PV), has a name, such as /dev/hdisk0. If the physical volume is in use, it belongs to a volume group (VG). All of the physical volumes in a volume group are divided into physical partitions (PPs) of the same size (by default, 4 MB in volume groups that include physical volumes smaller than 4 GB; 8 MB or more with bigger disks).

For space-allocation purposes, each physical volume is divided into five regions. See Position on Physical Volume for more information. The number of physical partitions in each region varies, depending on the total capacity of the disk drive.

Figure 10. Organization of Fixed-Disk Data (Unmirrored). The illustration shows the hierarchy of a physical volume that is partitioned into one or more logical volumes. These partitions or logical volumes contain file systems with directory structures which contain individual files. Files are written to blocks contained in tracks on the storage media and these blocks are usually not contiguous. Disk fragmenting occurs when data gets erased and new data files are written to the empty blocks that are randomly scattered around multiple tracks on the media. Artwork for h02i5

Within each volume group, one or more logical volumes (LVs) are defined. Each logical volume consists of one or more logical partitions. Each logical partition corresponds to at least one physical partition. If mirroring is specified for the logical volume, additional physical partitions are allocated to store the additional copies of each logical partition. Although the logical partitions are numbered consecutively, the underlying physical partitions are not necessarily consecutive or contiguous.

Logical volumes can serve a number of system purposes, such as paging, but each logical volume that holds ordinary system data or user data or programs contains a single journaled file system (JFS or Enhanced JFS). Each JFS consists of a pool of page-size (4096-byte) blocks. When data is to be written to a file, one or more additional blocks are allocated to that file. These blocks may or may not be contiguous with one another and with other blocks previously allocated to the file.

For purposes of illustration, the previous figure shows a bad (but not the worst possible) situation that might arise in a file system that had been in use for a long period without reorganization. The /op/filename file is physically recorded on a large number of blocks that are physically distant from one another. Reading the file sequentially would result in many time-consuming seek operations.

While an operating system's file is conceptually a sequential and contiguous string of bytes, the physical reality might be very different. Fragmentation may arise from multiple extensions to logical volumes as well as allocation/release/reallocation activity within a file system. A file system is fragmented when its available space consists of large numbers of small chunks of space, making it impossible to write out a new file in contiguous blocks.

Access to files in a highly fragmented file system may result in a large number of seeks and longer I/O response times (seek latency dominates I/O response time). For example, if the file is accessed sequentially, a file placement that consists of many, widely separated chunks requires more seeks than a placement that consists of one or a few large contiguous chunks. If the file is accessed randomly, a placement that is widely dispersed requires longer seeks than a placement in which the file's blocks are close together.

The effect of a file's placement on I/O performance diminishes when the file is buffered in memory. When a file is opened in the operating system, it is mapped to a persistent data segment in virtual memory. The segment represents a virtual buffer for the file; the file's blocks map directly to segment pages. The VMM manages the segment pages, reading file blocks into segment pages upon demand (as they are accessed). There are several circumstances that cause the VMM to write a page back to its corresponding block in the file on disk; but, in general, the VMM keeps a page in memory if it has been accessed recently. Thus, frequently accessed pages tend to stay in memory longer, and logical file accesses to the corresponding blocks can be satisfied without physical disk accesses.

At some point, the user or system administrator can choose to reorganize the placement of files within logical volumes and the placement of logical volumes within physical volumes to reduce fragmentation and to more evenly distribute the total I/O load. Chapter 8. Monitoring and Tuning Disk I/O Use contains further details about detecting and correcting disk placement and fragmentation problems.

Sequential-Access Read Ahead

The VMM tries to anticipate the future need for pages of a sequential file by observing the pattern in which a program is accessing the file. When the program accesses two successive pages of the file, the VMM assumes that the program will continue to access the file sequentially, and the VMM schedules additional sequential reads of the file. These reads are overlapped with the program processing, and will make the data available to the program sooner than if the VMM had waited for the program to access the next page before initiating the I/O. The number of pages to be read ahead is determined by two VMM thresholds:

minpgahead
Number of pages read ahead when the VMM first detects the sequential access pattern. If the program continues to access the file sequentially, the next read ahead will be for 2 times minpgahead, the next for 4 times minpgahead, and so on until the number of pages reaches maxpgahead.
maxpgahead
Maximum number of pages the VMM will read ahead in a sequential file.

The number of pages to read ahead on Enhanced JFS is determined by the two thresholds:
j2_minPageReadAhead
and
j2_maxPageReadAhead.

Write Behind

To increase write performance, limit the number of dirty file pages in memory, reduce system overhead, and minimize disk fragmentation, the file system divides each file into 16 KB partitions. The pages of a given partition are not written to disk until the program writes the first byte of the next 16 KB partition. At that point, the file system forces the four dirty pages of the first partition to be written to disk. The pages of data remain in memory until their frames are reused, at which point no additional I/O is required. If a program accesses any of the pages before their frames are reused, no I/O is required.

If a large number of dirty file pages remain in memory and do not get reused, the sync daemon writes them to disk, which might result in abnormal disk utilization. To distribute the I/O activity more efficiently across the workload, write-behind can be turned on to tell the system how many pages to keep in memory before writing them to disk. The write-behind threshold is on a per-file basis, which causes pages to be written to disk before the sync daemon runs. The I/O is spread more evenly throughout the workload.

There are two types of write-behind: sequential and random. The size of the write-behind partitions and the write-behind threshold can be changed with the vmtune command (see VMM Write-Behind).

Memory Mapped Files and Write Behind

Normal files are automatically mapped to segments to provide mapped files. This means that normal file access bypasses traditional kernel buffers and block I/O routines, allowing files to use more memory when the extra memory is available (file caching is not limited to the declared kernel buffer area).

Files can be mapped explicitly with the shmat() or mmap() subroutines, but this provides no additional memory space for their caching. Applications that use the shmat() or mmap() subroutines to map a file explicitly and access it by address rather than by the read() and write() subroutines may avoid some path length of the system-call overhead, but they lose the benefit of the system write-behind feature.

When applications do not use the write() subroutine, modified pages tend to accumulate in memory and be written randomly when purged by the VMM page-replacement algorithm or the sync daemon. This results in many small writes to the disk that cause inefficiencies in CPU and disk utilization, as well as fragmentation that might slow future reads of the file.

Disk-I/O Pacing

Because most writes are asynchronous, FIFO I/O queues of several megabytes can build up, which can take several seconds to complete. The performance of an interactive process is severely impacted if every disk read spends several seconds working its way through the queue. In response to this problem, the VMM has an option called I/O pacing to control writes.

I/O pacing does not change the interface or processing logic of I/O. It simply limits the number of I/Os that can be outstanding against a file. When a process tries to exceed that limit, it is suspended until enough outstanding requests have been processed to reach a lower threshold. Using Disk-I/O Pacing describes I/O pacing in more detail.

Support for Pinned Memory

AIX 4.3.3 and AIX 5.1 enable memory pages to be maintained in real memory all the time. This mechanism is called pinning memory. Pinning a memory region prohibits the pager from stealing pages from the pages backing the pinned memory region. Memory regions defined in either system space or user space may be pinned. After a memory region is pinned, accessing that region does not result in a page fault until the region is subsequently unpinned. While a portion of the kernel remains pinned, many regions are pageable and are only pinned while being accessed.

The advantage of having portions of memory pinned is that, when accessing a page that is pinned, you can retrieve the page without going through the page replacement algorithm. An adverse side effect of having too many pinned memory pages is that it can increase paging activity for unpinned pages, which would degrade performance.

To tune pinned memory, use the vmtune command to dedicate a number of pages at boot time for pinned memory. The following flags affect how AIX manages pinned memory:

maxpin
Specifies the maximum percentage of real memory that can be pinned.
v_pinshm
Setting the v_pinshm parameter to 1 (-S 1) causes pages in shared memory segments to be pinned by VMM, if the application, which does the shmget(), specifies SHM_PIN as part of the flags. The default value is 0. This option is available only in AIX 4.3.3 and later.

Large Page Support

In addition to regular page sizes of 4 kilobytes, beginning with AIX 5.1, the operating system supports large, 16-MB pages. Applications can use large pages with the shmget and shmat system calls. For the system to be able to use large pages, the pages must be enabled by specifying the SHM_LGPAGE flag with the shmget system call. Use this flag in conjunction with the SHM_PIN flag, and enable with the vmtune command.

To enable support for large pages, use the following flags with the vmtune command:

Table 1.
-gLargePageSize Specifies the size in bytes of the hardware-supported large pages used for the implementation for the shmget system call with the SHM_LGPAGE flag. Large pages must be enabled with a non-zero value for the -L flag and the bosboot command must be run and the system restarted for this change to take effect.
-LLargePages Specifies the number of large pages to reserve for implementing the shmget system call with the SHM_LGPAGE flag. For this change to take effect, you must specify the -g flag, run the bosboot command, and restart the system.

Use the following flags with the shmget system call:

SHM_LGPAGE
Creates the region so it can be mapped through hardware-supported, large-page mechanisms, if enabled. This flag must be used in conjunction with the SHM_PIN flag and enabled with the vmtune -L command, to reserve memory for the region (which requires a restart) and vmtune -S to enable SHM_PIN. This has no effect on shared memory regions created with the EXTSHM=ON environment variable.
SHM_PIN
Pins the shared memory region if enabled. This flag must be enabled with the vmtune command. This has no effect on shared memory regions created with EXTSHM=ON environment variable.

[ Top of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]