[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]

General Programming Concepts: Writing and Debugging Programs


Writing Programs That Access Large Files

Beginning in AIX 4.2, the operating system allows files that are larger than 2 gigabytes (2GB). This article is intended to assist programmers in understanding the implications of "large" files on their applications and to assist them in modifying their applications. A new set of programming interfaces is defined, so that application programs can be modified to be aware of large files.

The file system programming interfaces generally revolve around the off_t data type. In AIX 4.1, the off_t data type was defined as a signed 32-bit integer. As a result, the maximum file size that these interfaces would allow was 2 gigabytes minus 1.

Implications for Existing Programs

The 32-bit application environment that all applications used in prior releases remains unchanged. Existing application programs will execute exactly as they did before. However, existing application programs will not be able to deal with large files.

For example, the st_size field in the stat structure, which is used to x turn file sizes, is a signed, 32-bit long. Therefore, that stat structure cannot be used to return file sizes that are larger than LONG_MAX. If an application attempts to stat a file that is larger than LONG_MAX, the stat subroutine will fail, and errno will be set to EOVERFLOW, indicating that the file size overflows the size field of the structure being used by the program.

This behavior is significant because existing programs that might not appear to have any impacts as a result of large files will experience failures in the presence of large files even though they may not even be interested in the file size.

The errno EOVERFLOW can also be returned by lseek and by fcntl if the values that need to be returned are larger than the data type or structure that the program is using. For lseek, if the resulting offset is larger than LONG_MAX, lseek will fail and errno will be set to EOVERFLOW. For fcntl, if the caller uses F_GETLK and the blocking lock's starting offset or length is larger than LONG_MAX, the fcntl call will fail, and errno will be set to EOVERFLOW.

Open Protection

Many of the existing application programs were written under the assumption that a file size could never be larger than could be represented in a signed, 32-bit long. These programs could have unexpected behavior, including data corruption, if allowed to operate on large files. Beginning in AIX 4.2, the operating system implements an open-protection scheme to protect applications from this class of failure.

When an application that has not been enabled for large-file access attempts to open a file that is larger than LONG_MAX, the open subroutine will fail and errno will be set to EOVERFLOW. Application programs that have not been enabled will be unable to access a large file, and the possibility of inadvertent data corruption is avoided. Applications that need to be able to open large files must be ported to the large-file environment described in "Porting Applications to the Large File Environment".

In addition to open protection, a number of other subroutines offer protection by providing an execution environment, which is identical to the environment under which these programs were developed. If an application uses the write family of subroutines and the write request crosses the 2 gigabyte boundary, the write subroutines will transfer data only up to 2 gigabytes minus 1. If the application attempts to write at or beyond the 2Gb-1 boundary, the write subroutines will fail and set errno to EFBIG. The behavior of mmap, ftruncate, and fclear are similar.

The read family of subroutines also participates in the open protection scheme. If an application attempts to read a file across the 2 gigabyte threshold, only the data up to 2 gigabytes minus 1 will be read. Reads at or beyond the 2Gb-1 boundary will fail, and errno will be set to EOVERFLOW.

Open protection is implemented by a flag associated with an open file description. The current state of the flag can be queried with the fcntl subroutine using the F_GETFL command. The flag can be modified with the fcntl subroutine using the F_SETFL command.

Since open file descriptions are inherited across the exec family of subroutines, application programs that pass file descriptors that are enabled for large-file access to other programs should consider whether the receiving program can safely access the large file.

Porting Applications to the Large File Environment

Beginning in AIX 4.2, the operating system provides two different ways for applications to be enabled for large-file access. Application programmers must decide which approach best suits their needs. The first approach is to define _LARGE_FILES, which carefully redefines all of the relevant data types, structures, and subroutine names to their large-file enabled counterparts. The second approach is to recode the application to call the large-file enabled subroutines explicitly.

Defining _LARGE_FILES has the advantage of maximizing application portability to other platforms since the application is still written to the normal POSIX and XPG interfaces. It has the disadvantage of creating some ambiguity in the code since the size of the various data items is not obvious from looking at the code.

Recoding the application has the obvious disadvantages of requiring more effort and reducing application portability. It can be used when the redefinition effect of _LARGE_FILES would have a considerable negative impact on the program or when it is desirable to convert only a very small portion of the program.

It is very important to understand that in either case, the application program MUST be carefully audited to ensure correct behavior in the new environment. Some of the common programming pitfalls are discussed in "Common Pitfalls using the Large File Environment".

Using _LARGE_FILES

In the default compilation environment, the off_t data type is defined as a signed, 32-bit long. Beginning in AIX 4.2, if the application defines _LARGE_FILES before the inclusion of any header files, then the large-file programming environment is enabled. and off_t is defined to be a signed, 64-bit long long. In addition, all of the subroutines that deal with file sizes or file offsets are redefined to be their large-file enabled counterparts. Similarly, all of the data structures with embedded file sizes or offsets are redefined.

Assuming that the application is coded without any dependencies on off_t being a 32-bit quantity, the resulting binary should work properly in the new environment. In practice, application programs rarely require a porting effort this small.

The following table shows the redefinitions that occur in the _LARGE_FILES environment beginning in AIX 4.2.


Item Redefined To Be Header File
off_t long long <sys/types.h>
fpos_t long long <sys/types.h>
struct stat struct stat64 <sys/stat.h>
stat() stat64() <sys/stat.h>
fstat() fstat64() <sys/stat.h>
lstat() lstat64() <sys/stat.h>
mmap() mmap64() <sys/mman.h>
lockf() lockf64() <sys/lockf.h>
struct flock struct flock64 <sys/flock.h>
open() open64() <fcntl.h>
creat() creat64() <fcntl.h>
F_GETLK F_GETLK64 <fcntl.h>
F_SETLK F_SETLK64 <fcntl.h>
F_SETLKW F_SETLKW64 <fcntl.h>
ftw() ftw64() <ftw.h>
nftw() nftw64() <ftw.h>
fseeko() fseeko64() <stdio.h>
ftello() ftello64() <stdio.h>
fgetpos() fgetpos64() <stdio.h>
fsetpos() fsetpos64() <stdio.h>
fopen() fopen64() <stdio.h>
freopen() freopen64() <stdio.h>
lseek() lseek64() <unistd.h>
ftruncate() ftruncate64() <unistd.h>
truncate() truncate64() <unistd.h>
fclear() fclear64() <unistd.h>
struct aiocb struct aiocb64 <sys/aio.h>
aio_read() aio_read64() <sys/aio.h>
aio_write() aio_write64() <sys/aio.h>
aio_cancel() aio_cancel64() <sys/aio.h>
aio_suspend aio_suspend64() <sys/aio.h>
aio_listio() aio_listio64() <sys/aio.h>
aio_return() aio_return64() <sys/aio.h>
aio_error aio_error64() <sys/aio.h>

Using the 64-Bit File System Subroutines

Using the _LARGE_FILES environment may be impractical for some applications due to the far-reaching implications of changing the size of off_t to 64 bits. If the number of changes is small, it may be more practical to convert a relatively small part of the application to be large-file enabled. The 64-bit file system data types, structures, and subroutines are listed below:

<sys/types.h>
typedef long long off64_t;
typedef long long fpos64_t;
 
<fcntl.h>
 
extern int      open64(const char *, int, ...);
extern int      creat64(const char *, mode_t);
 
#define F_GETLK64
#define F_SETLK64
#define F_SETLKW64
 
<ftw.h>
extern int ftw64(const char *, int (*)(const char *,const struct stat64 *, int), int);
extern int nftw64(const char *, int (*)(const char *, const struct stat64 *, int,struct FTW *),int, int);
 
<stdio.h>
 
extern int      fgetpos64(FILE *, fpos64_t *);
extern FILE     *fopen64(const char *, const char *);
extern FILE     *freopen64(const char *, const char *, FILE *);
extern int      fseeko64(FILE *, off64_t, int);
extern int      fsetpos64(FILE *, fpos64_t *);
extern off64_t  ftello64(FILE *);
 
<unistd.h>
 
extern off64_t  lseek64(int, off64_t, int);
extern int      ftruncate64(int, off64_t);
extern int      truncate64(const char *, off64_t);
extern off64_t  fclear64(int, off64_t);
 
<sys/flock.h>
 
struct flock64;
 
<sys/lockf.h>
 
extern int lockf64 (int, int, off64_t);
 
<sys/mman.h>
 
extern void     *mmap64(void *, size_t, int, int, int, off64_t);
 
<sys/stat.h>
 
struct stat64;
 
extern int      stat64(const char *, struct stat64 *);
extern int      fstat64(int, struct stat64 *);
extern int      lstat64(const char *, struct stat64 *);
 
<sys/aio.h>
 
struct aiocb64
int     aio_read64(int, struct aiocb64 *):
int     aio_write64(int, struct aiocb64 *);
int     aio_listio64(int, struct aiocb64 *[],
        int, struct      sigevent *);
int     aio_cancel64(int, struct aiocb64 *);
int     aio_suspend64(int, struct aiocb64 *[]);

Common Pitfalls using the Large File Environment

Porting of application programs to the large-file environment can expose a number of different problems in the application. These problems are frequently the result of poor coding practices, which are harmless in a 32-bit off_t environment, but which can manifest themselves when compiled in a 64-bit off_t environment. The information below illustrates some of the more common problems and solutions.

Note: In the examples below, off_t is assumed to be a 64-bit file offset.

Improper Use of Data Types

The most obvious source of problems with application programs is a failure to use the proper data types. If an application attempts to store file sizes or file offsets in an integer variable, the resulting value will be truncated and lose significance. The proper technique for avoiding this problem is to use the off_t data type to store file sizes and offsets.

Incorrect

int file_size;
struct stat s;
 
file_size = s.st_size;

Better

off_t file_size;
struct stat s;
file_size = s.st_size;

Parameter Mismatches

Care must be taken when passing 64-bit integers to functions as arguments or when returning 64-bit integers from functions. Both the caller and the called function must agree on the types of the arguments and the return value in order to get correct results.

Passing a 32-bit integer to a function that expects a 64-bit integer causes the called function to misinterpret the caller's arguments, leading to unexpected behavior. This type of problem is especially severe if the program passes scalar values to a function that expects to receive a 64-bit integer.

Many of the problems can be avoided by careful use of function prototypes as illustrated below. In the code fragments below, fexample() is a function that takes a 64-bit file offset as a parameter. In the first example, the compiler generates the normal 32-bit integer function linkage, which would be incorrect since the receiving function expects 64-bit integer linkage. In the second example, the LL specifier is added, forcing the compiler to use the proper linkage. In the last example, the function prototype causes the compiler to promote the scalar value to a 64-bit integer. This is the preferred approach since the source code remains portable between 32- and 64-bit environments.

Incorrect

fexample(0);

Better

fexample(0LL);  

Best

void fexample(off_t);
 
fexample(0); 

Arithmetic Overflows

Even when an application uses the correct data types, it is still vulnerable to failures due to arithmetic overflows. This problem usually occurs when the application performs an arithmetic overflow before it is promoted to the 64-bit data type. In the following example, blkno is a 32-bit block number. Multiplying the block number by the block size occurs before the promotion, and overflow will occur if the block number is sufficiently large. This problem is especially destructive because the code is using the proper data types and the code works properly for small values, but fails for large values. The problem can be fixed by typecasting the values before the arithmetic operation.

Incorrect

int blkno;
off_t offset;
 
offset = blkno * BLKSIZE; 

Better

int blkno;
off_t offset;
offset = (off_t) blkno * BLKSIZE;

This problem can also appear when passing values based on fixed constants to functions that expect 64-bit parameters. In the example below, LONG_MAX+1 results in a negative number, which is sign-extended when it is passed to the function.

Incorrect

void fexample(off_t);
 
 fexample(LONG_MAX+1);                                          

Better

void fexample(off_t);
 
fexample((off_t)LONG_MAX+1);    

Fseek/Ftell

The data type used by fseek and ftell subroutines is long and cannot be redefined to the appropriate 64-bit data type in the _LARGE_FILES environment. Application programs that access large files and that use fseek and ftell need to be converted. This can be done in a number of ways. The fseeko and ftello subroutines are functionally equivalent to fseek and ftell except that the offset is given as an off_t instead of a long. Make sure to convert all variables that can be used to store offsets to the appropriate type.

Incorrect

long cur_offset, new_offset;
 
cur_offset = ftell(fp);
fseek(fp, new_offset, SEEK_SET);
           

Better

off_t cur_offset, new_offset;
 
cur_offset = ftello(fp);
fseeko(fp, new_offset, SEEK_SET);  

Failure to Include Proper Header Files

In order for application programs to see the function and data type redefinitions, they must include the proper header files. This has the additional benefit of exposing the function prototypes for various subroutines, which enables stronger type-checking in the compiler.

Many application programs that call the open and creat subroutines do not include <fcntl.h>, which contains the defines for the various open modes. These programs typically hard code the open modes. This will cause runtime failures when the program is compiled in the _LARGE_FILES environment because the program does call the proper open subroutine, and the resulting file descriptor is not enabled for large-file access. Programs must make sure to include the proper header files, especially in the _LARGE_FILES environment, to get visibility to the redefinitions of the environment.

Incorrect

fd = open("afile",2);
  

Better

#include <fcntl.h>
 
fd = open("afile",O_RDWR);

String Conversions

Converting file sizes and offsets to and from strings can cause problems when porting applications to the large-file environment. The printf format string for a 64-bit integer is different than for a 32 bit integer. Programs that do these conversions must be careful to use the proper format specifier. This is especially difficult when the application needs to be portable between 32- and 64-bit environments since there is no portable format specifier between the two environments. One way to deal with this problem is to write offset converters that use the proper format for the size of off_t.

off_t
atooff(const char *s)
{
         off_t o;
 
         if (sizeof(off_t) == 4)
                 sscanf(s,"%d",&o);
         else if (sizeof(off_t) == 8)
                 sscanf(s,"%lld",&o);
         else
                 error();
         return o;
}
         main(int argc, char **argv)
{
         off_t offset;
         offset = atooff(argv[1]);
         fexample(offset);
}

Imbedded File Offsets

Application programs that imbed file offsets or sizes in data structures may be affected by the change to the size of the off_t in the large-file environment. This problem can be especially severe if the data structure is shared between various applications or if the data structure is written into a file. In cases like this, the programmer must decide if it should continue to contain a 32-bit offset or if it should be converted to contain a 64-bit offset. If the application program needs to have a 32-bit file offset even if off_t is 64 bits, the program may use the new data type soff_t, a short off_t. This data type remains 32 bits even in the large-file environment. If the data structure is converted to a 64-bit offset, then all of the programs that deal with that structure must be converted to understand the new data structure format.

File Size Limits

Application programs that are converted to be aware of large files may fail in their attempts to create large files due to the file-size resource limit. The file-size resource limit is a signed, 32-bit value which limits maximum file offset to which a process can write to a regular file. Programs that need to write large files must have their file size limit set to RLIM_INFINITY.

struct rlimit r;
 
r.rlim_cur = r.rlim_max = RLIM_INFINITY;
setrlimit(RLIMIT_FSIZE,&r);

This limit may also be set from the Korn shell by issuing the command:

ulimit -f unlimited

To set this value permanently for a specific user, use the chuser command:

Example: chuser fsize_hard = -1 root

JFS File Size Limits

The maximum size of a file is ultimately a characteristic of the file system itself, not just the file size limit or the environment. For the JFS, the maximum file size is determined by the parameters used at the time the file system was made. For JFS file systems that are enabled for large files, the maximum file size is slightly less than 64 gigabytes (0xff8400000). For all other JFS file systems, the maximum file size is 2Gb-1 (0x7fffffff). Attempts to write a file past the maximum file size in any file system format will fail, and errno will be set to EFBIG.

JFS2 File Size Limits

For the JSF2. the maximun file size is limited by the file system itself.

Related Information

Command Support for Files Larger than 2 Gigabytes

Chapter 5, File Systems and Directories

Working with File I/O

JFS2 File Space Allocation


[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]