Working with File I/O

AIX Version 4.3 General Programming Concepts: Writing and Debugging Programs

Working with File I/O

All input and output (I/O) operations use the current file offset information stored in the system file structure. The current I/O offset designates a byte offset that is constantly tracked for every open file. It is called the current I/O offset because it signals a read or write process where to begin operations in the file. The open subroutine resets it to 0. The pointer can be set or changed using the lseek subroutine.

To learn more about file I/O, see:

Manipulating the Current Offset

Reading a File

Writing a File

Working with Pipes

Manipulating the Current Offset

Read and write operations can access a file sequentially. This is because the current I/O offset of the file tracks the byte offset of each previous operation. The offset is stored in the system file table.

You can adjust the offset on files that can be randomly accessed, such as regular and special-type files. Specifically, you can use the lseek subroutine to allow a process to position the offset at a designated byte. The lseek subroutine positions the pointer at the byte designated by the Offset variable. The Offset value can be calculated from three places in the file (designated by the value of the Whence variable):

absolute offset	Beginning byte of the file
relative offset	Position of the former pointer
end_relative offset	End of the file

The return value for the lseek subroutine is the current value of the pointer's position in the file. For example:

cur_off= lseek(fd, 0, SEEK_CUR);

The lseek subroutine is implemented in the file table. All following read and write operations use the new position of the offset as their starting location.

Note: The offset cannot be changed on pipes or socket-type files.

The fclear subroutine creates an empty space in a file. It sets to zero the number of bytes designated in the NumberOfBytes variable beginning at the current offset. The fclear subroutine cannot be used if the O_DEFER flag was set at the time the file was opened.

Reading a File

The read subroutine copies a specified number of bytes from an open file to a specified buffer. The copy begins at the point indicated by the current offset. The number of bytes and buffer are specified by the NBytes and Buffer parameters.

The read subroutine:

Assures that the FileDescriptor parameter is valid and that the process has read permissions. The subroutine then gets the file table entry specified by the FileDescriptor parameter.
Sets a flag in the file to indicate a read operation is in progress. This locks other processes out of the file during the operation.
Converts the offset byte value and the value of the NBytes variables into a block address.
Transfers the contents of the identified block into a storage buffer.
Copies the contents of the storage buffer into the area designated by the Buffer variable.
Updates the current offset according to the number of bytes actually read. Resetting the offset assures that the data is read in sequence by the next read process.
Deducts the number of bytes read from the total specified in the NByte variable.
Loops until the number of bytes to be read is satisfied.
Returns the total number of bytes read.

The cycle completes when: the file to be read is empty, the number of bytes requested is met, or a reading error is encountered during the process. Errors can occur while the file is being read from disk or in copying the data to the system file space.

It is advantageous for read requests to start at the beginning of data block boundaries and to be multiples of the data block size. An extra iteration in the read loop can be avoided.

If a process reads blocks sequentially, the operating system assumes all subsequent reads will be sequential too.

During the read operation, the i-node is locked. No other processes are allowed to modify the contents of the file while a read is in progress. However the file is unlocked immediately on completion of the read operation. If another process changes the file between two read operations, the resulting data is different, but the integrity of the data structure is maintained.

The following example illustrates how to use the read subroutine to count the number of null bytes in the foo file:

#include <fcntl.h>
#include <sys/param.h>
 
main()
{
        int fd;
        int nbytes;
        int nbytes;
        int nnulls;
        int i;
        char buf[PAGESIZE];      /*A convenient buffer size*/
        nnulls=0;
        if ((fd = open("foo",O_RDONLY)) < 0)
                exit();
        while ((nbytes = read(fd,buf,sizeof(buf))) > 0)
                for (i = 0; i < nbytes; i++)
                        if (buf[i] == '\0';
                                  nnulls++;
        printf("%d nulls found\n", nnulls);
}

Writing a File

The write subroutine adds the amount of data specified in the NBytes variable from the space designated by the Buffer variable to the file described by the FileDescriptor variable. It functions similar to the read subroutine. The byte offset for the write operation is found in the system file table's current offset.

Sometimes when you write to a file the file does not contain a block corresponding to the byte offset resulting from the write process. When this happens, the write subroutine allocates a new block. This new block is added to the i-node information that defines the file. If adding the new block produces an indirect block position (i_rindirect ), the subroutine allocates more than one block when a file moves from direct to indirect geometry.

During the write operation, the i-node is locked. No other processes are allowed to modify the contents of the file while a write is in progress. However the file is unlocked immediately on completion of the write operation. If another process changes the file between two write operations, the resulting data is different, but the integrity of the data structure is maintained.

The write subroutine loops in a way similar to the read subroutine, logically writing one block to disk for each iteration. At each iteration, the process either writes an entire block or only a portion of one. If only a portion of a data block is required to accomplish an operation, the write subroutine reads the block from disk to avoid overwriting existing information. If an entire block is required, it does not read the block because the entire block is overwritten. The write operation proceeds block by block until the number of bytes designated in the NBytes parameter is written.

Delayed Write

You can designate a delayed write process with the O_DEFER flag. Then, the data is transferred to disk as a temporary file. The delayed write feature caches the data in case another process reads or writes the data sooner. Delayed write saves extra disk operations. Many programs, such as mail and editors create temporary files in the directory /tmp and quickly remove them.

When a file is opened with the deferred update (O_DEFER) flag, the data is not written to permanent storage until a process issues an fsync subroutine call or a process issues a synchronous write to the file (opened with O_SYNC flag). The fsync subroutine saves all changes in an open file to disk. See the open subroutine for a description of the O_DEFER and O_SYNC flags.

Truncating Files

The truncate or ftruncate subroutine change the length of regular files. The truncating process must have write permission to the file. The Length variable value indicates the size of the file after the truncation operation is complete. All measures are relative to the first byte of the file, not the current offset. If the new length (designated in the Length variable) is less than the previous length, the data between the two is removed. If the new length is greater than the existing length, zeros are added to extend the file size. When truncation is complete, full blocks are returned to the file system, and the file size is updated.

Writing Programs to Use Direct I/O

Beginning in AIX 4.3, an application will be able to use Direct I/O on JFS files. This article is intended to assist programmers in understanding the intricacies involved with writing programs to take advantage of this feature.

Direct I/O vs. Normal Cached I/O

Normally, the JFS caches files pages in kernel memory. When the application does a file read request, if the file page is not in memory, the JFS reads the data from the disk into the file cache, then copies the data from the file cache to the user's buffer. For application writes, the data is merely copied from the user's buffer into the cache. The actual writes to disk are done later.

This type of caching policy can be extremely effective when the cache hit rate is high. It also enables read-ahead and write-behind policies. Lately, it makes file writes to the asynchronous, allowing the application to continue processing instead of waiting for I/O requests to complete.

Direct I/O is an alternative caching policy which causes the file dta to be transferred from the disk to/from the user's buffer. Direct I/O for files is functionally equivalent to raw I/O for devices.

Benefits of Direct I/O

The primary benefit of direct I/O is to reduce CPU utilization for file reads and writes by eliminating the copy from the cache to the user buffer. This can also be a benefit for file data which has a very poor cache hit rate. If the cache hit rate is low, then most read requests have to go to the disk. Direct I/O can also benefit applications which must use synchronous writes since these writes have to go to disk. In both of these cases, CPU usage is reduced since the data copy is eliminated.

A second benefit if direct I/O is that it allows applications to avoid diluting the effectiveness of caching of other files. Any time a file is read or written, that file competes for space in the cache. This may cause other file data to be pushed out of the cache. If the newly cached data has very poor reuse characterisitics, the effectiveness of the cache can be reduced. Direct I/O gives applications the ability to identify files where the normal caching policies are ineffective, thus freeing up more cache space for files where the policies are effective.

Performance Costs of Direct I/O

Although Direct I/O can reduce cpu usage, it typically results in longer wall clock times, especially for relatively small requests. This penalty is caused by the fundamental differences between normal cached I/O and Direct I/O.

Direct I/O Reads

Every Direct I/O read causes a synchronous read from disk; unlike the normal cached I/O policy where read may be satisfied from the cache. This can result in very poor performance if the data was likely to be in memory under the normal caching policy.

Direct I/O also bypasses the normal JFS read-ahead algorithms. These algorithms can be extremely effective for sequential access to files by issuing larger and larger read requests and by overlapping reads of future blocks with application processing.

Applications can compensate for the loss of JFS read-ahead by issuing larger reads requests. At a minimum, Direct I/O readers should issue read requests of at least 128k to match the JFS read-ahead characteristics.

Applications can also simulate JFS read-ahead by issuing asynchronous Direct I/O read-ahead either by use of multiple threads or by using aio_read.

Direct I/O Writes

Every direct I/O write causes a synchronous write to disk; unlike the normal cached I/O policy where the data is merely copied and then written to disk later. This fundamental difference can cause a significant performance penalty for applications which are converted to use Direct I/O.

Conflicting File Access Modes

In order to avoid consistency issues between programs which use Direct I/O and programs which use normal cached I/O, Direct I/O is an exclusive use mode. If there are multiple opens of a file and some of then are direct and other are not, the file will stay in its normal cached access mode. Only when the file is open exclusively by Direct I/O programs will the file be placed in Direct I/O mode.

Similarly, if the file is mapped into virtual memory via the shmat or mmap system calls, then file will stay in normal cached mode.

The JFS will attempt to move the file into Direct I/O mode any time the last conflicting. non-direct access is eliminated (either by close, munmap, or shmdt). Changing the file from normal mode to Direct I/O mode can be rather expensive since it requires writing all modified pages to disk and removing all the file's pages from memory.

Enabling Applications to use Direct I/O

Applications enable Direct I/O access to a file by passing the O_Direct flag to the fcntl.h . This flag is defined in open. Applications must be compiled with _ALL_SOURCE enabled to see the definition of O_DIRECT.

Offset/Length/Address Alignment Requirements of the Target Buffer

In order for Direct I/O to work efficiently, the request should be suitably conditioned. Applications can query the offset, length, and address alignment requirements by using the finfo and ffinfo subroutines. When the FI_DIOCAP command is used, finfo and ffinfo return information in the diocapbuf structure as described in sys/finfo.h. This structure contains the following fields:

dio_offset	This field contains the recommended offset alignment for direct I/O writes to files in this file system.
dio_max	This field contains the recommended maximum write length for Direct I/O writes to files in this system.
dio_min	This field contains the recommended minimum write length for Direct I/O writes to files in this file system.
dio_align	This field contains the recommended buffer alignment for Direct I/O writes to files in this file system.

Failure to meet these requirements may cause file reads and writes to use the normal cached model. Different file systems may have different requirements.

FS Format	dio_offset	dio_max	dio_min	dio_align
fixed, 4k blk	4k	2m	4k	4k
fragmented	4k	2m	4k	4k
compressed	n/a	n/a	n/a	n/a
big file	128k	2m	128k	4k

Direct I/O Limitations

Direct I/O is not supported for files in a compressed file filesystem. Attempts to open these files with O_DIRECT will be ignored and the files will be accessed with the normal cached I/O methods.

Direct I/O and Data I/O Integrity Completion

Although Direct I/O writes are done synchronously, they do not provide synchronized I/O data integrity completion, as defined by POSIX. Applications which need this feature should use O_DSYNC in addition O_DIRECT. O_DSYNC guarantees that all of the data and enough of the meta-data (eg. indirect blocks) have written to the stable store to be able to retrieve the data after a system crash. O_DIRECT only writes the data; it does not write the meta-data.

Working with Pipes

Pipes are unnamed objects created to allow two processes to communicate. One process reads and the other process writes to the pipe file. This unique type of file is also called a first-in-first-out (FIFO) file. The data blocks of the FIFO are manipulated in a circular queue, maintaining read and write pointers internally to preserve the FIFO order of data. The PIPE_BUF system variable, defined in the limits.h file, designates the maximum number of bytes guaranteed to be atomic when written to a pipe.

The shell uses unnamed pipes to implement command pipelining. Most unnamed pipes are created by the shell. The | (vertical) symbol represents a pipe between processes. For example:

ls | pr

the output of the ls command is printed to the screen.

Pipes are treated as regular files as far is possible. Normally, the current offset information is stored in the system file table. However, because pipes are shared by processes, the read/write pointers must be specific to the file, not to the process. File table entries are created by the open subroutine and are unique to the open process, not to the file. Processes with access to pipes share the access through common system file table entries.

Using Pipe Subroutines

The pipe subroutine creates an interprocess channel and returns two file descriptors. File descriptor 0 is opened for reading. File descriptor 1 is opened for writing. The read operation accesses the data on a FIFO basis. These two file descriptors are used with read, write, and close subroutines.

In the following example, a child process is created and sends its process ID back through a pipe:

#include <sys/types.h>
main()
{
        int p[2];
        char buf[80];
        pid_t pid;
 
        if (pipe(p))
        {
                  perror("pipe failed");
                exit(1)'
        }
        if ((pid=fork()) == 0)
        {
                                       /* in child process */
                close(p[0]);           /*close unused read */ 
                                       *side of the pipe */
                sprintf(buf,"%d",getpid());  
                                       /*construct data */ 
                                       /*to send */
                write(p[1],buf,strlen(buf)+1);
                        /*write it out, including 
                        /*null byte */
                exit(0);
        }
                                        /*in parent process*/
        close(p[1]);                    /*close unused write side                                          /*side of pipe */
        read(p[0],buf,sizeof(buf));     /*read the pipe*/
        printf("Child process said: %s/n", buf);
                                       /*display the result */
        exit(0);
}

If a process reads an empty pipe, the process waits until data arrives. If a process writes to a pipe which is too full (PIPE_BUF), the process waits until space is available. If the write side of the pipe is closed, a subsequent read operation to the pipe returns an end-of-file.

Two other subroutines that control pipes are the popen and pclose subroutines. The popen subroutine creates the pipe (using the pipe subroutine) then forks to create a copy of the caller. The child process decides whether it is supposed to read or write, closes the other side of the pipe, then calls the shell (using the execl subroutine) to run the desired process. The parent closes the end of the pipe it did not use. These closes are necessary to make end-of-file tests work properly. For example, if a child process intended to read the pipe does not close the write end of the pipe, it will never see the end of file condition on the pipe, because there is one write process potentially active.

The conventional way to associate the pipe descriptor with the standard input of a process is:

close(p[1]);
close(0);
dup(p[0]);
close(p[0]);

The close subroutine disconnects file descriptor 0, the standard input. The dup subroutine returns a duplicate of an already open file descriptor. File descriptors are assigned in increasing order and the first available one is returned. The effect of the dup subroutine is to copy the file descriptor for the pipe (read side) to file descriptor 0, thus standard input becomes the read side of the pipe. Finally, the previous read side is closed. In order for a child process to write from a parent, the process is similar.

The pclose subroutine closes a pipe between the calling program and a shell command to be executed. Use the pclose subroutine to close any stream opened with the popen subroutine. The pclose subroutine waits for the associated process to end then closes and returns the exist status of the command. This subroutine is preferable to the close subroutine because pclose waits for child processes to finish before closing the pipe. Equally important, when a process creates several children, only a bounded number of unfinished child processes can exist, even if some of them have completed their tasks; performing the wait allows child processes to complete their tasks.

Synchronous I/O

By default, writes to files in JFS file systems are asynchronous. However, JFS file systems support three types of synchronous IO. One type is specified by the O_DSYNC open flag. When a file is opened using the O_DSYNC open mode, the write () system call will not return until the file data and all file system meta-data required to retrieve the file data are both written to their permanent storage locations.

Another type of synchronous IO is specified by the O_SYNC open flag. In addition to items specified by O_DSYNC, O_SYNC specifies that the write () system call will not return until all file attributes relative to the I/O are written to their permanent storage locations -- even if the attirbutes are not required to retrieve the file data.

Before the O_DSYNC open mode existed, AIX applied O_DSYNC semantics to O_SYNC. For binary compatibility reasons, this behavior can never change. If true O_SYNC behavior is required, then both O_DSYNC and O_SYNC open flags must be specified. Exporting the XPG_SUS_ENV=ON environment variable also enables true O_SYNC behavior.

The last type of synchronous IO is specified by the O_RSYNC open flag, and it simply applies the behaviors associated with O_SYNC or _DSYNC to reads. For files in JFS file systems, only the combination of O_RSYNC | O_SYNC has meaning. It means that the read system call will not return until the file's access time is written to its permanent storage location.

Related Information

Files, Directories, and File Systems for Programmers provides an overview and orientation to the topic.

Working with JFS i-nodes describes the internal representation of files and lists the contents of disk i-nodes and in-core (main memory) i-nodes.

File Space Allocation introduces the i-node's indirect block method of expanding files.

Using File Descriptors explains the creation and use of file descriptors.

The ls command, the pr command.

The close subroutine, exec, execl, execv, execle, execve, execlp, execvp or exect subroutine, fclear subroutine, fsync subroutine, lseek subroutine, open, openx, or creat subroutine, read, readx, readv, or readvx subroutine, truncate or ftruncate subroutines, write, writex, writev, or writevx subroutine.