Managing Shared Disks

Tuning Virtual Shared Disk performance

The IBM Virtual Shared Disk device driver passes all its requests to the underlying Logical Volume Manager subsystem. Before you tune the virtual shared disk, check that the I/O subsystem is not the bottleneck. See AIX Performance and Tuning Guide for information on I/O subsystem performance and tuning. If an overloaded I/O subsystem is degrading your system's performance, tuning the virtual shared disk will not help. In the case of I/O subsystem overload, consider spreading the I/O load over more disks or nodes.

For best performance, do the following:

Use the defaults when defining virtual shared disks (refer to Designating nodes as IBM Virtual Shared Disk nodes)
Turn IBM Virtual Shared Disk caching off if you are not using your system for online transaction processing.
Unload the device driver from the kernel. See Unloading the device driver from the kernel.
Do a performance run to collect statistics on the virtual shared disks, your I/O subsystem, and the CPU on all nodes. Issue statvsd several times during the performance run and compare the values for the various statistics. Use iostat to check your disk utilization. If you notice increasing numbers of queued requests, do the following:
- If the system is I/O bound (meaning your disks are more than 50% utilized), add disks.
- If the system is CPU bound, add nodes or spread the workload on the virtual shared disk server nodes. You can use the Hashed Shared Disk data striping subsystem to spread the workload.
- If nodes are doing excessive swapping due to insufficient pinned memory, which is used by pbufs, buddy buffers, cache, and the switch pool, reduce cache size.
- If requests are queuing because of a shortage of buddy buffers or pbufs, you might have disk bottlenecks. Spread the data or add disks to the server nodes.
- If your application issues requests that are larger than 64KB, set your maximum buddy buffer size to 256KB.
- If you see too many retries, check for disk bottlenecks. If that is not the problem, consider increasing the switch pool size (see mbufs and the switch pool).
Reset the statistics counter by running the ctlvsd command (you can use the Run Command... action of the IBM Virtual Shared Disk Perspective graphical user interface).
Do another performance run.

You should generally operate with IBM Virtual Shared Disk caching off. Memory is better allocated to the operating system itself, for paging, and to the cache belonging to the application using the virtual shared disk. To turn IBM Virtual Shared Disk caching off, do the following:

Shut down your applications that use virtual shared disks and stop activity (see Stopping Virtual Shared Disk activity)
If you do not use the IBM Recoverable Virtual Shared Disk subsystem, unconfigure the virtual shared disks (see Unconfiguring Virtual Shared Disks and Hashed Shared Disks)
Select one or more nodes
Use the Run Command... action and run the updatevsdtab command to change the cache/nocache option to nocache
If you do not use the IBM Recoverable Virtual Shared Disk subsystem, configure the virtual shared disks (see Configuring Virtual Shared Disks or Hashed Shared Disks)
If you do use the IBM Recoverable Virtual Shared Disk subsystem, refresh the virtual shared disk configuration (see Refreshing the IBM Recoverable Virtual Shared Disk subsystem)
Restart your applications.

If you do use caching, remember that the IBM Virtual Shared Disk component only caches 4KB requests aligned on 4KB boundaries.

See the PSSP: Command and Technical Reference for command options and syntax.

Tunable parameters related to Virtual Shared Disks

The IBM Virtual Shared Disk device driver has been modified so that it is mostly self-tuning. The following parameters are no longer tunable:

Request blocks
pbufs

The main tunable parameters are:

Logical Volume Manager (striping and other characteristics)
IP communications adapter (usually the switch)
The IBM Virtual Shared Disk cache buffer
Buddy buffer
Maximum I/O request size
mbufs

These are discussed with relevant tuning considerations in the following sections. You should also consider the tunable characteristics of the applications that use virtual shared disks, especially the use of buffers.

Logical Volume Manager (LVM) tuning considerations

There is always an associated logical volume for every virtual shared disk defined and configured in a system. Every virtual shared disk I/O request eventually becomes an I/O request to the associated logical volume (unless you get a cache hit at the server). This mapping of virtual shared disk I/O requests to the associated logical volume I/O requests is handled transparently by the IBM Virtual Shared Disk subsystem. All the performance tuning considerations that apply to a logical volume also apply to a virtual shared disk. Refer to AIX System Management Guide: Operating Systems and Devices and AIX Performance and Tuning Guide for information on the performance tuning of logical volumes.

SP Switch considerations when using the IP protocol for data transmissions

If you configure the virtual shared disk nodes to use the SP Switch (css0), set the maximum IP message size (utilized by the virtual shared disk driver) to 61440 (60KB). Note that the value you assign to maximum_buddy_buffer_size in the SDR also limits the maximum size of the request that the IBM Virtual Shared Disk subsystem sends across the nodes. For example, if you have:

A request from a client to write 256KB of data to a remote virtual shared disk
A maximum buddy buffer size of 64KB
A maximum IP message size of 60KB

the following transmission sequence occurs:

The IBM Virtual Shared Disk subsystem divides the 256KB of data into four 64KB requests in four buddy buffers
Each 64KB block of data becomes one 60KB packet and one 4KB packet for transmission to the server via IP
At the server, the eight packets are reassembled into four 64KB blocks of data, each in a 64KB buddy buffer
The server then has to perform four 64KB write operations and return four acknowledgements to the client.

A better scenario for the same write operation would use the maximum buddy buffer size:

The same 256KB client request to the remote virtual shared disk
The maximum buddy buffer size of 256KB
The maximum IP message size of 60KB

Producing the following transmission sequence:

The 256KB request becomes four 60 KB packets and one 16KB packet for transmission to the server via IP
At the server, the five packets are reassembled into one 256KB block of data in a single buddy buffer
The server then performs one 256 KB write operation and returns an acknowledgement to the client.

The second scenario is preferable to the first because the I/O operations at the server are minimized. A perfect scenario would be one where the IBM Virtual Shared Disk component does not use buddy buffers at all -- when the client request is less than or equal to the maximum IP message size. For example:

A request from a client to write 60KB of data to a remote virtual shared disk server
A maximum IP message size of 60KB

When you use the switch, send pool clusters are used instead of buddy buffers as long as the request size is less than the ip_message_size, as in the example just cited. Buddy buffers are used only when a shortage in the switch buffer pool occurs or when the size of the request is greater than the ip_message_size. If you see buddy buffer shortages, instead of increasing your buddy buffers, you need to increase your switch send pool size. See mbufs and the switch pool.

mbufs and the switch pool

mbufs are used for data transfer between the client and the server nodes by the IBM Virtual Shared Disk subsystem's own UDP-like internet protocol. If you are using the switch (css0) as your communications adapter, the IBM Virtual Shared Disk component uses mbuf clusters to do I/O directly from the switch's send and receive pools for data requests that are less than or equal to 60 KB.

If you notice that the indirect I/O statistic (from the IBM Virtual Shared Disk Perspectives Statistics notebook page or from the output of the vsdstat command) is incremented consistently, run errpt to check the error log. If you see the line:

IFIOCTL_MGET(): send pool shortage

you should consider increasing the size of the send and receive pools.

To check the current sizes of the send and receive pools, type:

lsattr -l css0 -E

The default size for each pool is 524288 bytes (512KB). IBM suggests setting this to 16MB.

To change the sizes of the pools to 16MB, type:

/usr/lpp/ssp/css/chgcss -l css0 -a spoolsize=16777216
/usr/lpp/ssp/css/chgcss -l css0 -a rpoolsize=16777216

Note:: You must reboot the node for the new sizes to take effect.

System performance considerations regarding mbufs and mbuf clusters also apply to virtual shared disk environments. See AIX Performance and Tuning Guide for more information.

Buddy buffers

The virtual shared disk server node uses the buddy buffer to temporarily store data for I/O operations originating at a client node and to handle requests that are greater than the ip_message_size. In contrast to the data in the cache buffer, the data in a buddy buffer is purged immediately after the I/O operation completes.

The values associated with the buddy buffer are:

Minimum buddy buffer size allocated to a single request
Maximum buddy buffer size allocated to a single request
Total size of the buddy buffer

These values can be set using the IBM Virtual Shared Disk Perspective graphical user interface or the vsdnode command.

Buddy buffer space is allocated in powers of two. If an I/O request size is not a power of two, the smallest power of two that is larger than the request is allocated. For example, for a request size of 24KB, 32KB are allocated on the server.

If you are using the switch as your adapter for virtual shared disks, IBM suggests setting 4096 (4KB) and 262144 (256KB), respectively, for minimum and maximum buddy buffer size allocated to a single request.

To define the total size of the buddy buffer, consider the remote I/O throughput for the server and specify the number of maximum-sized buddy buffers in the buffer. For example, if you expect the server to serve 10MB per second on behalf of remote clients and a request spends an average of 60 milliseconds on the server, multiply 10MB per second X 0.06 second and, for safety, double or triple the result for a total buddy buffer size of 1.8MB (eight 256KB maximum buddy buffers). IBM suggests the range of 32-96 buddy buffers (where the maximum buddy buffer size has been set to 256 KB)

If the virtual shared disk statistics consistently show requests queued waiting for buddy buffers, check the switch send and receive pool sizes before trying to add more buddy buffers. (See mbufs and the switch pool) or spread the data over disks attached to other nodes, to prevent a bottleneck.

Note:: If your application uses the fastpath option of asynchronous I/O, the maximum buddy buffer size must be greater than or equal to 128KB. Otherwise, you will get EMSGSIZE "Message too long" errors.

SP Switch considerations when using the KLAPI protocol for data transmissions

The maximum IP message size is not used to break up data requests when the IBM Virtual Shared Disk subsystem transmits data using the KLAPI protocol. The maximum IP message size should still be set to 61440 (60KB) because the IP protocol will still be used at times.

Buddy buffers

The KLAPI protocol always uses buddy buffers. See Buddy buffers for more information.

Buffer allocation

Your application should make all new allocated buffers on the page boundary. If your I/O buffer is not aligned on a page boundary, the IBM Virtual Shared Disk device drivers will not parallelize I/O requests to underlying virtual shared disks and performance will be degraded. In addition, the KLAPI transport service requires the data to be page aligned to avoid a data copy.

The cache buffer

Each IBM Virtual Shared Disk device driver, that is, each node, has a single cache buffer, shared by all cacheable virtual shared disks configured on and served by the node. The cache buffer is used to store the most recently accessed data from the cached virtual shared disks (associated logical volumes) on the server node. The objective is to minimize physical disk I/O activity. If the requested data is found in the cache, it is read from the cache, rather than the corresponding logical volume.

Data in the cache is stored in 4KB blocks. The content of the cache is a replica of the corresponding data blocks on the physical disks. Write-through cache semantics apply; that is, the write operation is not complete until the data is on the disk.

When you create virtual shared disks with the IBM Virtual Shared Disk Perspective graphical user interface or the createvsd command, you can specify the cache option or the nocache option. IBM suggests that you specify nocache (or make the cache buffer small) in most instances (especially in the cases of read-only or other than 4KB applications) for the following reasons:

Requests that are not exactly 4KB and not aligned on a 4KB boundary will bypass the cache buffer, but will incur the overhead of searching the cache blocks for overlapping pages.
Every 4KB I/O operation incurs the overhead of copying into or out of the cache buffer, as well as the overhead of moving program data from the processor cache due to the copy.
There is overhead for maintaining an index on the blocks cached.

If you are running an application that involves heavy writing followed immediately by reading, it might be advantageous to turn the cache buffer on for some virtual shared disks on a particular node. Choose the appropriate size for the cache based on the expected throughput and the expected time lag between writes and reads. For example, if the expected throughput is 100 4 KB-aligned I/O operations per second and reads lag writes by 0.5 seconds, calculate the cache buffer size by multiplying 100 X 0.5 and, as a safety factor, double it for a total of 100 cache blocks.

The lsvsd -s command gives detailed statistics on virtual shared disk cache hits and I/O activities. This will tell you which virtual shared disks are heavily used. See Monitoring Virtual Shared Disks for information on how to see statistics using the IBM Virtual Shared Disk Perspective and the Event Perspective graphical user interfaces.

Maximum I/O request size

The following factors limit the block size that the IBM Virtual Shared Disk subsystem uses to process each I/O request:

The largest block size the IBM Virtual Shared Disk subsystem will use is the max_buddy_buffer_size.
If the virtual shared disk uses the switch as its adapter, the max_IP_msg_size that could be sent is 65024 bytes (63.5KB). IBM suggests the value 61440 (60KB) for the virtual shared disk device driver when css0 is defined as the virtual shared disk adapter in the SDR. The ctlvsd -M command can override the default. The max_IP_msg_size should be set to a value that is a multiple of 512 bytes and is less than or equal to 63.5KB (when the switch is used) and less than or equal to 24KB (when the switch is not used). The statvsd command displays the current value.
Note:
Setting the max_IP_msg_size to more than 24KB when using communication adapters with small MTU (maximum transmission unit) could overflow the adapter driver's internal buffers, causing the IP layer to drop packets. This forces the virtual shared disk device driver to retry, sometimes without success, resulting in a timeout.

The atomicity of an I/O operation is gated by the size of the virtual shared disk request, rather than the size of the application request (if the virtual shared disk request is smaller than the application request the application request would be split down to the size of the virtual shared disk request). The atomicity of an I/O operation is also gated by the max BB size.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]