Performance Management Guide

UDP and TCP/IP Performance Overview

To understand the performance characteristics of UDP (user datagram protocol) and TCP/IP, you must first understand some of the underlying architecture. The following figure illustrates the structure that will be discussed in this chapter.

Figure 24. UDP and TCP/IP Data Flow. The figure shows the path of data from an application in one system to another application in a remote system. The steps of the data flow are described in the text immediately following the illustration. Artwork for h09i1

Note

(ARP) address resolution protocol, see Send Flow.

The figure shows the path of data from an application in one system to another application in a remote system. The processing at each of the layers is discussed in this chapter, but key points are as follows:

The application's write request causes the data to be copied from the application's working segment to the socket send buffer.
The socket layer or subsystem calls UDP or TCP.
The operating system has variable size clusters, so an optimum size is used when:
- UDP copies and computes the checksum of the data into a socket buffer.
- TCP copies the data to socket buffer.
If the size of the data is larger than the maximum transfer unit (MTU) of the LAN, then:
- TCP breaks the output into segments that comply with the MTU limit.
- UDP leaves the breaking up of the output to the IP layer.
If necessary, IP fragments the output into pieces that comply with the MTU, so that no outgoing packet exceeds the MTU limit.
The packets are put on the device output queue and transmitted by the LAN adapter to the receiving system. If the output queue for the device overflows, the packet is discarded.
Arriving packets are placed on the device driver's receive queue, and pass through the Interface layer to IP.
If IP in the receiving system determines that IP in the sending system had fragmented a block of data, it coalesces the fragments into their original form and passes the data to TCP or UDP.
- TCP reassembles the original segments and places the input in the socket receive buffer.
- UDP passes the input on to the socket receive buffer. If the input socket (udp_recvspace) limit is reached, the packet is discarded.
When the application makes a read request, the appropriate data is copied from the socket receive buffer in kernel memory into the buffer in the application's buffer.

Communication Subsystem Memory (mbuf) Management

To avoid fragmentation of kernel memory and the overhead of numerous calls to the xmalloc() subroutine, the various layers of the communication subsystem share common buffer pools. The mbuf management facility controls different buffer sizes. The pools consist of pinned pieces of kernel virtual memory; this means that they always reside in physical memory and are never paged out. The result is that the real memory available for paging in application programs and data has been decreased by the amount that the mbuf pools have been increased.

In addition to avoiding duplication, sharing the mbuf and cluster pools allows the various layers to pass pointers to one another, reducing mbuf management calls and copying of data.

For additional details, see Tuning mbuf Pool Performance.

Socket Layer

Sockets provide the application program interface (API) to the communication subsystem. Several types of sockets provide various levels of service by using different communication protocols. Sockets of type SOCK_DGRAM use the UDP protocol. Sockets of type SOCK_STREAM use the TCP protocol.

The processes of opening, reading, and writing to sockets are similar to those for manipulating files.

The sizes of the buffers in system virtual memory (that is, the total number of bytes from the mbuf pools) that are used by the input and output sides of each socket are limited by system-wide default values (which can be overridden for a given socket by a call to the setsockopt() subroutine):

udp_sendspace and udp_recvspace: Buffer sizes for datagram sockets in bytes. The defaults are 9216 and 42080, respectively.
tcp_sendspace and tcp_recvspace: Buffer sizes for stream sockets in bytes. The defaults for both values are 16384. With AIX 4.3.3 and later, these two parameters can also be set using ISNO (see Interface-Specific Network Options (ISNO)).

Use the following to display these values:

# no -a

A root user can set these values as follows:

# no -o udp_sendspace=NewValue

The NewValue parameter must be less than or equal to the sb_max parameter, which controls the maximum amount of space that can be used by a socket's send or receive buffer. The default value of the sb_max parameter depends on the operating system version and amount of real memory. The sb_max value is displayed with the command no -a and set with the no command, as follows:

# no -o sb_max=NewLimit

Note

Socket send or receive buffer sizes are limited to no more than sb_max bytes, because sb_max is a ceiling on buffer space consumption. The two quantities are not measured in the same way, however. The socket buffer size limits the amount of data that can be held in the socket buffers. The sb_max value limits the number of bytes of mbufs that can be in the socket buffer at any given time. In an Ethernet environment, for example, each 2048-byte mbuf cluster might hold just 1500 bytes of data. In that case, sb_max would have to be 1.37 times larger than the specified socket buffer size to allow the buffer to reach its specified capacity. The guideline is to set sb_max to at least twice the size of the largest socket buffer.

Send Flow

As an application writes to a socket, the socket layer calls the transport layer (either TCP or UDP), which copies the data from user space into the socket send buffer in kernel space. Depending on the amount of data being copied into the socket send buffer, the code puts the data into either mbufs or clusters.

Receive Flow

On the receive side, an application opens a socket and attempts to read data from it. If there is no data in the socket receive buffer, the socket layer causes the application thread to go to the sleep state (blocking) until data arrives. When data arrives, it is put on the receive socket buffer queue and the application thread is made dispatchable. The data is then copied into the application's buffer in user space, the mbuf chain is freed, and control is returned to the application.

Socket Creation

In AIX 4.3.1 and later, the sockthresh value determines how much of the system's network memory can be used before socket creation is disallowed. The value of sockthresh is given as a percentage of thewall. It has a default of 85 percent and can be set to any value from 1 to 100. However, sockthresh cannot be set to a value lower than the amount of memory currently in use.

The sockthresh option is intended to prevent situations where many connections are opened until all the network memory on the machine is used. This leaves no memory for other operations, and the machine hangs and must be rebooted to recover. Use sockthresh to set the point at which new sockets should not be allowed. Calls to the socket() and socketpair() subroutines will fail with an error of ENOBUFS, and incoming connection requests will be silently discarded. This allows the remaining network memory to be used by existing connections and prevents the machine from hanging.

The netstat -m statistic sockets not created because sockthresh was reached is incremented each time a socket creation fails because the amount of network memory already in use is over the sockthresh limit.

Use the following to display the sockthresh value:

# no -o sockthresh

A root user can set the value as follows:

# no -o sockthresh=NewValue

The default value can be set as follows:

# no -d sockthresh

Ephemeral Ports

When an application requests that the system assign the port (application is not requesting a specific port number), this is called an ephemeral port. Prior to AIX 4.3.1, the ephemeral port range was from 1024 to 5000. Starting with AIX 4.3.1, the default starting ephemeral port number is 32768, and the default largest ephemeral port number is 65535.

Using the no command, these values can be tuned with the tcp_ephemeral_low and tcp_ephemeral_high parameters. The maximum range would be to set tcp_ephemeral_low to 1024 and tcp_ephemeral_high to 65535. UDP ports have the same tunable parameters available through udp_ephemeral_low and udp_ephemeral_high (defaults are identical).

Relative Level of Function in UDP and TCP

The following two sections contain descriptions of the function of UDP and TCP. To facilitate comparison of UDP and TCP, both descriptions are divided into subsections on connection, error detection, error recovery, flow control, data size, and MTU handling.

UDP Layer

UDP provides a low-cost protocol for applications that have the facilities to deal with communication failures. UDP is most suitable for request-response applications. Because such an application has to handle a failure to respond anyway, it is little additional effort to handle communication error as one of the causes of failure to respond. For this reason, and because of its low overhead, subsystems such as NFS, ONC RPC, DCE RPC, and DFS use UDP.

Features of the UDP layer are as follows:

Connection: None. UDP is essentially a stateless protocol. Each request received from the caller is handled independently of those that precede or follow it. (If the connect() subroutine is called for a datagram socket, the information about the destination is considered a hint to cache the resolved address for future use. It does not actually bind the socket to that address or affect UDP on the receiving system.)
Error detection: Checksum creation and verification. The sending UDP builds the checksum and the receiving UDP checks it. If the check fails, the packet is dropped.
Error recovery: None. UDP does not acknowledge receipt of packets, nor does it detect their loss in transmission or through buffer-pool overflow. Consequently, UDP never retransmits a packet. Recovery must be performed by the application.
Flow control: None. When UDP is asked to send, it sends the packet to IP. When a packet arrives from IP, it is placed in the socket-receive buffer. If either the device driver/adapter buffer queue or the socket-receive buffer is full when the packet arrives, the packet is dropped without an error indication. The application or subsystem that sent the packet must detect the failure by timeout or sequence and retry the transmission. Various statistics show counts of discarded packets (see the netstat -s and netstat -D commands in The netstat Command).
Data size: Must fit in one buffer. This means that the buffer pools on both sides of UDP must have buffer sizes that are adequate for the applications' requirements. The maximum size of a UDP packet is 64 KB. Of course, an application that builds large blocks can break them into multiple datagrams itself (for example, DCE), but it is simpler to use TCP.
MTU handling: None. Dealing with data larger than the maximum transfer unit (MTU) size for the interface is left to IP. If IP has to fragment the data to make it fit the MTU, loss of one of the fragments becomes an error that the application or subsystem must deal with timeout and retransmit logic.

Send Flow

If udp_sendspace is large enough to hold the datagram, the application's data is copied into mbufs in kernel memory. If the datagram is larger than udp_sendspace, an error is returned to the application.

The operating system chooses optimum size buffers from a power of 2 size buffer. For example, a write of 8704 bytes is copied into two clusters, a 8192-byte and a 512-byte cluster. UDP adds the UDP header (in the same mbuf, if possible), checksums the data, and calls the IP ip_output() routine.

Receive Flow

UDP verifies the checksum and queues the data onto the proper socket. If the udp_recvspace limit is exceeded, the packet is discarded. A count of these discards is reported by the netstat -s command under udp: as socket buffer overflows. If the application is waiting for a receive or read on the socket, it is put on the run queue. This causes the receive to copy the datagram into the user's address space and release the mbufs, and the receive is complete. Usually, the receiver responds to the sender to acknowledge the receipt and also return a response message.

In AIX 4.1.1 and later, UDP checksums the data "on the fly" when it copies it into the kernel mbuf. When receiving, this same optimization can be done, but the application must enable it with the SO_CKSUMRECV option on a setsockopt() call. Applications that receive large UDP buffers should program to use this option for better performance.

TCP Layer

TCP provides a reliable transmission protocol. TCP is most suitable for applications that, at least for periods of time, are mostly output or mostly input. With TCP ensuring that packets reach their destination, the application is freed from error detection and recovery responsibilities. Applications that use TCP transport include ftp, rcp, and telnet. DCE can use TCP if it is configured to use a connection-oriented protocol.

Features of the TCP layer are as follows:

Connection

Explicit. The instance of TCP that receives the connection request from an application (call it the initiator, sender,or transmitter) establishes a session with its counterpart on the other system, which you will call the listener, or receiver. All exchanges of data and control packets are within the context of that session.

Error detection

Checksum creation and verification. The sending TCP builds the checksum and the receiving TCP checks it. If checksum verification fails, the receiver does not acknowledge receipt of the packet.
Some PCI adapters now have TCP checksum offload. For example, the Gigabit Ethernet adapter for transmits and receives, and the ATM 155 adapter for transmits. The default is set to on. The transmit can be disabled with the ifconfig command and the checksum_offload parameter, while the receive requires a chdev command to set cx_checksum=no.

Error recovery

Full. TCP detects checksum failures and loss of a packet or fragment through timeout. In error situations TCP retransmits the data until it is received correctly (or notifies the application of an unrecoverable error).

Flow control

Enforced. TCP uses a discipline called a sliding window to ensure delivery to the receiving application. The sliding window concept is illustrated in the following figure. (The records shown in the figure are for clarity only. TCP processes data as a stream of bytes and does not keep track of record boundaries, which are application-defined.)

Figure 25. TCP Sliding Window. This illustration depicts the TCP Sliding Window. A full description is in the text immediately following the figure. Artwork for
h09i5

In this figure, the sending application is sleeping because it has attempted to write data that would cause TCP to exceed the send socket buffer space (that is, tcp_sendspace). The sending TCP still has the last part of rec5, all of rec6 and rec7, and the beginning of rec8. The receiving TCP has not yet received the last part of rec7 or any of rec8. The receiving application got rec4 and the beginning of rec5 when it last read the socket, and it is now processing that data. When the receiving application next reads the socket, it will receive (assuming a large enough read), the rest of rec5, rec6, and as much of rec7 and rec8 as has arrived by that time.

After the next read, the following occur:

The receiving TCP will be able to acknowledge that data
The sending TCP will be able to discard the data
The pending write will complete

The sending application will wake up. To avoid excessive LAN traffic when the application is reading in tiny amounts, TCP delays acknowledgment until the receiving application has read a total amount of data that is at least half the receive window size or twice the maximum segment size.

If there is no data to send back, the receiver will delay up to 200 ms and then send the ACK. The delay time can be tuned by a new no parameter called fasttimeo. The default value is 200 ms, and the range of values can be between 50 ms and 200 ms. Reducing this value may enhance performance of request/response type of applications.

Note

When using TCP to exchange request/response messages, the application must use the setsockopt() subroutine to turn on the TCP_NODELAY option. This causes TCP to send the message immediately (within the constraints of the sliding window), even though it is less than MTU-size. Otherwise, TCP would wait for up to 200 milliseconds for more data to send before transmitting the message. Starting with AIX 4.3.3 the tcp_nodelay parameter can be set with the ifconfig or chdev command to set TCP_NODELAY on TCP sockets (see Interface-Specific Network Options (ISNO)).

In the course of establishing a session, the initiator and the listener converse to determine the receive space for each end point. The size defines the size of the receive window. As data is written to the socket, it is moved into the sender's buffer. When the receiver indicates that it has space available, the sender transmits enough data to fill that space (assuming that it contains that much data). When the receiving application reads from the socket, the receiving socket returns as much data as it has in its receive socket buffer. TCP then informs the sender that the data has been successfully delivered by sending a packet to advance the receiver window. Only then does the sending TCP discard the data from its own buffer, effectively moving the window to the right by the amount of data delivered. If the window is full because the receiving application has fallen behind, the sending thread will be blocked (or receive a specific errno) when it tries to write to the socket.

The value of tcp_recvspace and tcp_sendspace are independent. The tcp_sendspace controls the buffering in the kernel of the sender. The tcp_recvspace controls the receiver space and translates into TCP's receive window.

If the rfc1323 parameter is 1, the maximum TCP window size is 4 GB (instead of 64 KB).

Data size: Indefinite. TCP does not process records or blocks; it processes a stream of bytes. If a send buffer is larger than the receiver can handle, it is segmented into MTU-size packets. Because it handles shortages of buffer space under the covers, TCP does not guarantee that the number and size of data receives will be the same as the number and size of sends. It is the responsibility of the two sides of the application to identify record or block boundaries, if any, within the stream of data.
MTU handling: Handled by segmentation in TCP. When the connection is established, the initiator and the listener negotiate a maximum segment size (MSS) to be used. The MSS is typically smaller than the MTU (see Tuning TCP Maximum Segment Size). If the output packet size exceeds the MSS, TCP does the segmentation, thus making fragmentation in IP unnecessary. The receiving TCP typically puts the segments on the socket receive queue as they arrive. If the receiving TCP detects the loss of a segment, it withholds acknowledgment and holds back the succeeding segments until the missing segment has been received successfully.

The additional operations performed by TCP to ensure a reliable connection result in about 5 to 10 percent higher processor cost than in UDP.

Send Flow

When the TCP layer receives a write request from the socket layer, it allocates a new mbuf for its header information and copies the data in the socket-send buffer either into the TCP-header mbuf, if there is room, or into a newly allocated mbuf chain. If the data being copied is in clusters, the data is not actually copied into new clusters. Instead, a pointer field in the new mbuf header (this header is part of the mbuf structure and is unrelated to the TCP header) is set to point to the clusters containing the data, thereby avoiding the overhead of one or more 4 KB copies. TCP then checksums the data (unless it is offloaded by certain PCI adapters), updates its various state variables, which are used for flow control and other services, and finally calls the IP layer with the header mbuf now linked to the new mbuf chain.

Receive Flow

When the TCP input routine receives input data from IP, the following occur:

It checksums the TCP header and data for corruption detection (unless it is offloaded by certain PCI adapters)
Determines which connection this data is for
Removes its header information
Links the mbuf chain onto the socket-receive buffer associated with this connection
Uses a socket service to wake up the application (if it is sleeping as described earlier)

IP Layer

The Internet Protocol provides a basic datagram service to the higher layers. If it is given a packet larger than the MTU of the interface, it fragments the packet and sends the fragments to the receiving system, which reassembles them into the original packet. If one of the fragments is lost in transmission, the incomplete packet is ultimately discarded by the receiver. MTU path discovery can be enabled as described in Tuning TCP Maximum Segment Size.

The length of time IP waits for a missing fragment is controlled by the ipfragttl parameter, which is set and displayed with the no command.

Following are some default values and value ranges for different network types:

Network Type	Default (bytes)	Range (bytes)
X.25	576	60-2058
SLIP	1006	60-4096
Standard Ethernet	1500	60 - 1500
IEEE 802.3 Ethernet	1492	60 - 1492
Gigabit Ethernet	9000 (Jumbo Frames)	N/A
Token-Ring 4 Mbps	1492	60 - 4096
Token-Ring 16 Mbps	1492	60 - 17800
FDDI	4352	1 - 4352
SLA (socc)	61428	1 - 61428
ATM	9180	1 - 65527
HIPPI	65536	60 - 65536
SP Switch	65520	1 - 65520

Note

In general, you can increase the transmit and receive queues. This requires some memory, but avoids some problems. See Adapter Transmit and Receive Queue Tuning.

Send Flow

When the IP output routine receives a packet from UDP or TCP, it identifies the interface to which the mbuf chain should be sent, updates and checksums the IP part of the header, and passes the packet to the interface (IF) layer.

IP determines the proper device driver and adapter to use, based on the network number. The driver interface table defines the maximum MTU for this network. If the datagram is less than the MTU size, IP adds the IP header in the existing mbuf, checksums the IP header, and calls the driver to send the frame. If the driver send queue is full, an EAGAIN error is returned to IP, which returns it to UDP, which returns it to the sending application. The sender should delay and try again.

If the datagram is larger than the MTU size (which only occurs in UDP), IP fragments the datagram into MTU-size fragments, appends a IP header (in an mbuf) to each, and calls the driver once for each fragment frame. If the driver's send queue is full, an EAGAIN error is returned to IP. IP discards all remaining unsent fragments associated with this datagram and returns EAGAIN to UDP. UDP returns EAGAIN the sending application. Since IP and UDP do not queue messages, it is up to the application to delay and try the send again.

Receive Flow

In AIX Version 4, in general, interfaces do not perform queuing and directly call the IP input queue routine to process the packet; the loopback interface will still perform queuing. In the case of queuing, the demux layer places incoming packets on this queue. If the queue is full, packets are dropped and never reach the application. If packets are dropped at the IP layer, a statistic called ipintrq overflows in the output of the netstat -s command is incremented. If this statistic increases in value, then use the no command to tune the ipqmaxlen tunable.

In AIX Version 4, the demux layer (formerly called the IF layer) calls IP on the interrupt thread. IP checks the IP header checksum to make sure the header was not corrupted and determines if the packet is for this system. If so, and the frame is not a fragment, IP passes the mbuf chain to the TCP or UDP input routine.

If the received frame is a fragment of a larger datagram (which only occurs in UDP), IP retains the frame. When the other fragments arrive, they are merged into a logical datagram and given to UDP when the datagram is complete. IP holds the fragments of an incomplete datagram until the ipfragttl time (as specified by the no command) expires. The default ipfragttl time is 30 seconds (an ipfragttl value of 60). If any fragments are lost due to problems such as network errors, lack of mbufs, or transmit queue overruns, IP never receives them. When ipfragttl expires, IP discards the fragments it did receive. This is reported as a result from the netstat -s command. Under ip:, see fragments dropped after timeout.

Demux Layer

The interface layer (IF) is used on output and is the same level as the demux layer (used for input) in AIX Version 4. It places transmit requests on to a transmit queue, where the requests are then serviced by the network interface device driver. The size of the transmit queue is tunable, as described in Adapter Transmit and Receive Queue Tuning.

Send Flow

When the demux layer receives a packet from IP, it attaches the link-layer header information to the beginning of the packet, checks the format of the mbufs to make sure they conform to the device driver's input specifications, and then calls the device driver write routine.

The address resolution protocol (ARP) is also handled in this layer. ARP translates a 32-bit Internet Protocol (IP) address into a 48-bit hardware address.

Receive Flow

In AIX Version 4, when the demux layer receives a packet from the device driver, it calls IP on the interrupt thread to perform IP input processing.

If the dog threads are enabled (see Enabling Thread Usage on LAN Adapters (dog threads)), the incoming packet will be queued to the thread and the thread will handle calling IP, TCP, and the socket code.

LAN Adapters and Device Drivers

The operating system environment supports many different kinds of LAN adapters. You can choose from a wide variety of network interfaces. As the following table shows, as the speed of these networks varies, so does the performance.

Name	Speed
Ethernet (en)	10 Mbit/sec - Gigabits/sec
IEEE 802.3 (et)	10 Mbit/sec - Gigabits/sec
Token-Ring (tr)	4 or 16 Mbit/sec
X.25 protocol (xt)	64 Kb/sec
Serial Line Internet Protocol, SLIP (sl)	64 Kb/sec
loopback (lo)	N/A
FDDI (fi)	100 Mbit/sec
SOCC (so)	220 Mbit/sec
ATM (at)	100s Mbit/sec (many Gb/sec)

Refer to the PCI Adapter Placement Reference and RS/6000 Systems Handbook for slot placement guidelines and limitations that may exist on the number of adapters that can be supported for connectivity and the number that can be supported for maximum performance.

Several PCI machines have secondary PCI buses bridged onto a primary PCI bus. Some medium- to high-speed adapters perform slower on these secondary bus slots and some adapters are not recommended to be used in these slots. Machines with some secondary PCI slots include E30, F40, and SP 332 MHz SMP-wide nodes.

The adapters differ, not only in the communications protocol and transmission medium they support, but also in their interface to the I/O bus and the processor. Similarly, the device drivers vary in the technique used to convey the data between memory and the adapter. The following description of send and receive flow applies to most adapters and device drivers, but details vary.

Send Flow

At the device-driver layer, the mbuf chain containing the packet is enqueued on the transmit queue. The maximum total number of output buffers that can be queued is controlled by the system parameter xmt_que_size. In some cases, the data is copied into driver-owned DMA buffers. The adapter is then signaled to start DMA operations.

At this point, control returns up the path to the TCP or UDP output routine, which continues sending as long as it has data to send. When all data has been sent, control returns to the application, which then runs asynchronously while the adapter transmits data. Device driver dependent, when the adapter has completed transmission, it sends an interrupt to the system. When the interrupt is handled, the device-interrupt routines are called to adjust the transmit queues and free the mbufs that held the transmitted data.

Receive Flow

When frames are received by an adapter, they are transferred from the adapter into a driver-managed receive queue. The receive queue can consist of mbufs or the device driver can manage a separate pool of buffers for the device. In either case, the data is in an mbuf chain when it is passed from the device driver to the demux layer.

Some drivers receive frames through Direct Memory Access (DMA) into a pinned area of memory and then allocate mbufs and copy the data into them. Drivers/adapters that receive large-MTU frames may have the frames accessed directly into cluster mbufs. The driver transfers the frame to the correct network protocol (IP in this example) by calling a demultiplexing function that identifies the packet type and puts the mbuf containing the buffer on the input queue for that network protocol. If no mbufs are available or if the higher-level input queue is full, the incoming frames are discarded.