There are two protection domains in the operating system: the user protection domain and the kernel mode protection domain.
Application programs run in the user protection domain, which provides:
When a program is running in the user protection domain, the processor executes instructions in the problem state, and the program does not have direct access to kernel data.
The code in the kernel and kernel extensions run in the kernel protection domain. This code includes interrupt handlers, kernel processes, device drivers, system calls, and file system code. The processor is in the kernel protection domain when it executes instructions in the privileged state, which provides:
Code running in the kernel protection domain can affect the execution environments of all processes because it:
Programming errors in the code running in the kernel protection domain can cause the operating system to fail. In particular, a process's user data cannot be accessed directly, but must be accessed using the copyin and copyout kernel services, or their variants. These routines protect the kernel from improperly supplied user data addresses.
Application programs can gain controlled access to kernel data by making system calls. Access to functions that directly or indirectly invoke system calls is typically provided by programming libraries, providing access to operating system functions.
When a user program invokes a system call, a system call instruction is executed, which causes the processor to begin executing the system call handler in the kernel protection domain. This system call handler performs the following actions:
The system loader maintains a table of the functions that are used for each system call.
The system call runs within the calling thread, but with more privilege because system calls run in the kernel protection domain. After the function implementing the system call has performed the requested action, control returns to the system call handler. If the ut_error field in the uthread structure has a non-zero value, the value is copied to the application's thread-specific errno variable. If a signal is pending, signal processing take place, which can result in an application's signal handler being invoked. If no signals are pending, the system call handler restores the state of the calling thread, which is resumed in the user protection domain. For more information on protection domains, see Understanding Protection Domains.
A system call can access data that the calling thread cannot access because system calls execute in the kernel protection domain. The following are the general categories of kernel data:
System calls should use the kernel services to read or modify data traditionally found in the ublock or uthread structures. For example, the system call handler uses the value of the thread's ut_error field to update the thread-specific errno variable before returning to user mode. This field can be read or set by using the getuerror and setuerror kernel services. The current process ID can be obtained by using the getpid kernel service, and the current thread ID can be obtained by using the thread_self kernel service.
System calls can also access global memory such as the kernel and kernel data regions. These regions contain the code and static data for the system call as well as the rest of the kernel.
A system call routine runs on a protected stack associated with a calling thread, which allows a system call to execute properly even when the stack pointer to the calling thread is invalid. In addition, privileged data can be saved on the stack without danger of exposing the data to the calling thread.
Parameters are passed to system calls in the same way that parameters are passed to other functions, but some additional calling conventions and limitations apply.
First, system calls cannot have floating-point parameters. In fact, the operating system does not preserve the contents of floating-point registers when a system call is preempted by another thread, so system calls cannot use any floating-point operations.
Second, a system call in the 32-bit kernel cannot return a long long value to a 32-bit application. In 32-bit mode, long long values are returned in a pair of general purpose registers, GPR3 and GPR4. Only GPR3 is preserved by the system call handler before it returns to the application. A system call in the 32-bit kernel can return a 64-bit value to a 64-bit application, but the saveretval64 kernel service must used.
Third, since a system call runs on its own stack, the number of arguments that can be passed to a system call is limited. The operating system linkage conventions specify that up to eight general purpose registers are used for parameter passing. If more parameters exist than will fit in eight registers, the remaining parameters are passed in the stack. Because a system call does not have direct access to the application's stack, all parameters for system calls must fit in eight registers.
Some parameters are passed in multiple registers. For example, 32-bit applications pass long long parameters in two registers, and structures passed by value can require multiple registers, depending on the structure size. The writer of a system call should be familiar with the way parameters are passed by the compiler and ensure that the 8-register limit is not exceeded. For more information on parameter calling conventions, see Subroutine Linkage Convention in Assembler Language Reference.
Finally, because 32- and 64-bit applications are supported by both the 32- and 64-bit kernels, the data model used by the kernel does not always match the data model used by the application. When the data models do not match, the system call might have to perform extra processing before parameters can be used.
Regardless of whether the 32-bit or 64-bit kernel is running, the interface that is provided by the kernel to applications must be identical. This simplifies the development of applications and libraries, because their behavior does not depend on the mode of the kernel. On the other hand, system calls might need to know the mode of the calling process. The IS64U macro can be used to determine if the caller of a system call is a 64-bit process. For more information on the IS64U macro, see IS64U Kernel Service in AIX 5L Version 5.2 Technical Reference: Kernel and Subsystems Volume 1.
The ILP32 and LP64 data models differ in the way that pointers and long and long long parameters are treated when used in structures or passed as functional parameters. The following tables summarize the differences.
Type | Size | Used as Parameter |
---|---|---|
long | 32 bits | One register |
pointer | 32 bits | One register |
long long | 64 bits | Two registers |
Type | Size | Used as Parameter |
---|---|---|
long | 64 bits | One register |
pointer | 64 bits | One register |
long long | 64 bits | One register |
System calls using these types must take the differing data models into account. The treatment of these types depends on whether they are used as parameters or in structures passed as parameters by value or by reference.
Scalar parameters (pointers and integral values) are passed in registers. The combinations of kernel and application modes are:
When a 32-bit application makes a system call to the 64-bit kernel, the system call handler zeros the high-order word of each parameter register. This allows 64-bit system calls to use pointers and unsigned long parameters directly. Signed and unsigned integer parameters can also be used directly by 64-bit system calls. This is because in 64-bit mode, the compiler generates code that sign extends or zero fills integers passed as parameters. Similar processing is performed for char and short parameters, so these types do not require any special handling either. Only signed long and long long parameters need additional processing.
To convert a 32-bit signed long parameter to a 64-bit value, the 32-bit value must be sign extended. The LONG32TOLONG64 macro is provided for this operation. It converts a 32-bit signed value into a 64-bit signed value, as shown in this example:
syscall1(long incr) { /* If the caller is a 32-bit process, convert * 'incr' to a signed, 64-bit value. */ if (!IS64U) incr = LONG32TOLONG64(incr); . . . }
If a parameter can be either a pointer or a symbolic constant, special handling is needed. For example, if -1 is passed as a pointer argument to indicate a special case, comparing the pointer to -1 will fail, as will unconditionally sign-extending the parameter value. Code similar to the following should be used:
syscall2(void *ptr) { /* If caller is a 32-bit process, * check for special parameter value. */ if (!IS64U && (LONG32TOLONG64(ptr) == -1) ptr = (void *)-1; if (ptr == (void *)-1) special_handling(); else { . . . } }
Similar treatment is required when an unsigned long parameter is interpreted as a signed value.
A 32-bit application passes a long long parameter in two registers, while a 64-bit kernel system call uses a single register for a long long parameter value.
The system call function prototype cannot match the function prototype used by the application. Instead, each long long parameter should be replaced by a pair of uintptr_t parameters. Subsequent parameters should be replaced with uintptr_t parameters as well. When the caller is a 32-bit process, a single 64-bit value will be constructed from two consecutive parameters. This operation can be performed using the INTSTOLLONG macro. For a 64-bit caller, a single parameter is used directly.
For example, suppose the application function prototype is:
syscall3(void *ptr, long long len1, long long len2, int size);
The corresponding system call code should be similar to:
syscall3(void *ptr, uintptr_t L1, uintptr_t L2, uintptr_t L3, uintptr_t L4, uintptr_t L5) { long len1; long len2; int size; /* If caller is a 32-bit application, len1 * and len2 must be constructed from pairs of * parameters. Otherwise, a single parameter * can be used for each length. */ if (!IS64U) { len1 = INTSTOLLONG(L1, L2); len2 = INTSTOLLONG(L3, L4); size = (int)L5; } else { len1 = (long)L1 len2 = (long)L2 size = (int)L3; } . . . }
For the most part, system call parameters from a 64-bit application can be used directly by 64-bit system calls. The system call handler does not modify the parameter registers, so the system call sees the same values that were passed by the application. The only exceptions are the pid_t and key_t types, which are 32-bit signed types in 64-bit applications, but are 64-bit signed types in 64-bit system calls. Before these two types can be used, the 32-bit parameter values must be sign extended using the LONG32TOLONG64 macro.
No special parameter processing is required when 32-bit applications call 32-bit system calls. Application parameters can be used directly by system calls.
When 64-bit applications make system calls, 64-bit parameters are passed in registers. When 32-bit system calls are running, the high-order words of the parameter registers are not visible, so 64-bit parameters cannot be obtained directly. To allow 64-bit parameter values to be used by 32-bit system calls, the system call handler saves the high-order word of each 64-bit parameter register in a save area associated with the current thread. If a system call needs to obtain the full 64-bit value, use the get64bitparm kernel service.
If a 64-bit parameter is an address, the system call might not be able to use the address directly. Instead, it might be necessary to map the 64-bit address into a 32-bit address, which can be passed to various kernel services.
When a 32-bit system call function is called by the system call handler on behalf of a 64-bit process, the parameter registers are treated as 32-bit registers, and the system call function can only see the low-order word of each parameter. For integer, char, or short parameters, the parameter can be used directly. Otherwise, the get64bitparm kernel service must be called to obtain the full 64-bit parameter value. This kernel service takes two parameters: the zero-based index of the parameter to be obtained, and the value of the parameter as seen by the system call function. This value is the low-order word of the original 64-bit parameter, and it will be combined with the high-order word that was saved by the system call handler, allowing the original 64-bit parameter to be returned as a long long value.
For example, suppose that the first and third parameters of a system call are 64-bit values. The full parameter values are obtained as shown:
#include <sys/types.h> syscall4(char *str, int fd, long count) { ptr64 str64; int64 count64; if (IS64U) { /* get 64-bit address. */ str64 = get64bitparm(str, 0); /* get 64-bit value */ count64 = get64bitparm(count, 2); } . . . }
The get64bitparm kernel service must not be used when the caller is a 32-bit process, nor should it be used when the parameter type is an int or smaller. In these cases, the system call parameter can be used directly. For example, the fd parameter in the previous example can be used directly.
When a system call parameter is a pointer passed from a 64-bit application, the full 64-bit address is obtained by calling the get64bitparm kernel service. Thereafter, consideration must be given as to how the address will be used.
A system call can use a 64-bit address to access user-space memory by calling one of the 64-bit data-movement kernel services, such as copyin64, copyout64, or copyinstr64. Alternatively, if the user address is to be passed to kernel services that expect 32-bit addresses, the 64-bit address should be mapped to a 32-bit address.
Mapping associates a 32-bit value with a 64-bit address. This 32-bit value can be passed to kernel services in the 32-bit kernel that expect pointer parameters. When the 32-bit value is passed to a data-movement kernel service, such as copyin or copyout, the original 64-bit address will be obtained and used. Address mapping allows common code to be used for many kernel services. Only the data-movement routines need to be aware of the address mapping.
Consider a system call that takes a path name and a buffer pointer as parameters. This system call will use the path name to obtain information about the file, and use the buffer pointer to return the information. Because pathname is passed to the lookupname kernel service, which takes a 32-bit pointer, the pathname parameter must be mapped. The buffer address can be used directly. For example:
int syscall5 ( char *pathname, char *buffer) { ptr64 upathanme; ptr64 ubuffer; struct vnode *vp; struct cred *crp; /* If 64-bit application, obtain 64-bit parameter * values and map "pathname". */ if (IS64U) { upathname = get64bitparm(pathname, 0); /* The as_remap64() call modifies pathname. */ as_remap64(upathname, MAXPATH, &pathname); ubuffer = get64bitparm(buffer, 1); } else { /* For 32-bit process, convert 32-bit address * 64-bit address. */ ubuffer = (ptr64)buffer; } crp = crref(); rc = lookupname(pathname, USR, L_SEARCH, NULL, &vp, crp); getinfo(vp, &local_buffer); /* Copy information to user space, * for both 32-bit and 64-bit applications. */ rc = copyout64(&local_buffer, ubuffer, strlen(local_buffer)); . . . }
The function prototype for the get64bitparm kernel service is found in the sys/remap.h header file. To allow common code to be written, the get64bitparm kernel service is defined as a macro when compiling in 64-bit mode. The macro simply returns the specified parameter value, as this value is already a full 64-bit value.
In some cases, a system call or kernel service will need to obtain the original 64-bit address from the 32-bit mapped address. The as_unremap64 kernel service is used for this purpose.
For some system calls, it is necessary to return a 64-bit value to 64-bit applications. The 64-bit application expects the 64-bit value to be contained in a single register. A 32-bit system call, however, has no way to set the high-order word of a 64-bit register.
The saveretval64 kernel service allows a 32-bit system call to return a 64-bit value to a 64-bit application. This kernel service takes a single long long parameter, saves the low-order word (passed in GPR4) in a save area for the current thread, and returns the original parameter. Depending on the return type of the system call function, this value can be returned to the system call handler, or the high-order word of the full 64-bit return value can be returned.
After the system call function returns to the system call handler, the original 64-bit return value will be reconstructed in GPR3, and returned to the application. If the saveretval64 kernel service is not called by the system call, the high-order word of GPR3 is zeroed before returning to the application. For example:
void * syscall6 ( int arg) { if (IS64U) { ptr64 rc = f(arg); saveretval64(rc); /* Save low-order word */ return (void *)(rc >> 32); /* Return high-order word as * 32-bit address */ } else { return (void *)f(arg); } }
When structures are passed to or from system calls, whether by value or by reference, the layout of the structure in the application might not match the layout of the same structure in the system call. There are two ways that system calls can process structures passed from or to applications: structure reshaping and dual implementation.
Structure reshaping allows system calls to support both 32- and 64-bit applications using a single system call interface and using code that is predominately common to both application types.
Structure reshaping requires defining more than one version of a structure. One version of the structure is used internally by the system call to process the request. The other version should use size-invariant types, so that the layout of the structure fields matches the application's view of the structures. When a structure is copied in from user space, the application-view structure definition is used. The structure is reshaped by copying each field of the application's structure to the kernel's structure, converting the fields as required. A similar conversion is performed on structures that are being returned to the caller.
Structure reshaping is used for structures whose size and layout as seen by an application differ from the size and layout as seen by the system call. If the system call uses a structure definition with fields big enough for both 32- and 64-bit applications, the system call can use this structure, independent of the mode of the caller.
While reshaping requires two versions of a structure, only one version is public and visible to the end user. This version is the natural structure, which can also be used by the system call if reshaping is not needed. The private version should only be defined in the source file that performs the reshaping. The following example demonstrates the techniques for passing structures to system calls that are running in the 64-bit kernel and how a structure can be reshaped:
/* Public definition */ struct foo { int a; long b; }; /* Private definition--matches 32-bit * application's view of the data structure. */ struct foo32 { int a; int b; } syscall7(struct foo *f) { struct foo f1; struct foo32 f2; if (IS64U()) { copyin(&f1, f, sizeof(f1)); } else { copyin(&f2, f, sizeof(f2)); f1.a = f2.a; f1.b = f2.b; } /* Common structure f1 used from now on. */ . . . }
The dual implementation approach involves separate code paths for calls from 32-bit applications and calls from 64-bit applications. Similar to reshaping, the system call code defines a private view of the application's structure. With dual implementations, the function syscall7 could be rewritten as:
syscall8(struct foo *f) { struct foo f1; struct foo32 f2; if (IS64U()) { copyin(&f1, f, sizeof(f1)); /* Code for 64-bit process uses f1 */ . . . } else { copyin(&f2, f, sizeof(f2)); /* Code for 32-bit process uses f2 */ . . . } }
Dual implementation is most appropriate when the structures are so large that the overhead of reshaping would affect the performance of the system call.
When structures are passed by value, the structure is loaded into as many parameter registers as are needed. When the data model of an application and the data model of the kernel extension differ, the values in the registers cannot be used directly. Instead, the registers must be stored in a temporary variable. For example:
/* Application prototype: syscall9(struct foo f); */ syscall9(unsigned long a1, unsigned long a1) { union { struct foo f1; /* Structure for 64-bit caller. */ struct foo32 f2; /* Structure for 32-bit caller. */ unsigned long p64[2]; /* Overlay for parameter registers * when caller is 64-bit program */ unsigned int p32[2]; /* Overlay for parameter registers * when caller is 32-bit program */ } uarg; if (IS64U()) { uarg.p64[0] = a1; uarg.p64[1] = a2; /* Now uarg.f1 can be used */ . . . } else { uarg.p32[0] = a1; uarg.p32[1] = a2; /* Now uarg.f2 can be used */ . . . } }
In AIX 4.3, the conventions for passing parameters from a 64-bit application to a system call required user-space library code to perform some of the parameter reshaping and address mapping. In AIX 5.1 and later, all parameter reshaping and address mapping should be performed by the system call, eliminating the need for kernel-specific library code. In fact, user-space address mapping is no longer supported. In most cases, system calls can be implemented without any application-specific library code.
The kernel allows a thread to be preempted by a more favored thread, even when a system call is executing. This capability provides better system responsiveness for large multi-user systems.
Because system calls can be preempted, access to global data must be serialized. Kernel locking services, such as simple_lock and simple_unlock, are frequently used to serialize access to kernel data. A thread can be preempted even when it owns a lock. If multiple locks are obtained by system calls, a technique must be used to prevent multiple threads from deadlocking. One technique is to define a lock hierarchy. A system call must never return while holding a lock. For more information on locking, see Understanding Locking.
Signals can be generated asynchronously or synchronously with respect to the thread that receives the signal. An asynchronously generated signal is one that results from some action external to a thread. It is not directly related to the current instruction stream of that thread. Generally these are generated by other threads or by device drivers.
A synchronously generated signal is one that results from the current instruction stream of the thread. These signals cause interrupts. Examples of such cases are the execution of an illegal instruction, or an attempted data access to nonexistent address space.
Delivery of signals to a thread only takes place when a user application is about to be resumed in the user protection domain. Signals cannot be delivered to a thread if the thread is in the middle of a system call. For more information on signal delivery for kernel processes, see Using Kernel Processes.
An asynchronous signal can alter the operation of a system call or kernel extension by terminating a long wait. Kernel services such as e_block_thread, e_sleep_thread, and et_wait are affected by signals. The following options are provided when a signal is posted to a thread:
The sleep kernel service, provided for compatibility, also supports the PCATCH and SWAKEONSIG options to control the response to a signal during the sleep function.
Previously, the kernel automatically saved context on entry to the system call handler. As a result, any long (interruptible) sleep not specifying the PCATCH option returned control to the saved context when a signal interrupted the wait. The system call handler then set the errno global variable to EINTR and returned a return code of -1 from the system call.
The kernel, however, requires each system call that can directly or indirectly issue a sleep call without the PCATCH option to set up a saved context using the setjmpx kernel service. This is done to avoid overhead for system calls that handle waits terminated by signals. Using the setjmpx service, the system can set up a saved context, which sets the system call return code to a -1 and the ut_error field to EINTR, if a signal interrupts a long wait not specifying return-from-signal.
It is probably faster and more robust to specify return-from-signal on all long waits and use the return code to control the system call return.
The kernel supports nested calls to the setjmpx kernel service. It implements the stack of saved contexts by maintaining a linked list of context information anchored in the machine state save area. This area is in the user block structure for a process. Interrupt handlers have special machine state save areas.
An initial context is set up for each process by the initp kernel service for kernel processes and by the fork subroutine for user processes. The process terminates if that context is resumed.
Exceptions are interrupts detected by the processor as a result of the current instruction stream. They therefore take effect synchronously with respect to the current thread.
The default exception handler generates a signal if the process is in a state where signals can be delivered immediately. Otherwise, the default exception handler generates a system dump.
For certain types of exceptions, a system call can specify unique exception-handler routines through calls to the setjmpx service. The exception handler routine is saved as part of the stacked saved context. Each exception handler is passed the exception type as a parameter.
The exception handler returns a value that can specify any of the following:
If the exception handler did not handle the exception, then the next exception handler in the stack of contexts is called. If none of the stacked exception handlers handle the exception, the kernel performs default exception handling. The setjmpx and longjmpx kernel services help implement exception handlers.
The operating system supports nested system calls with some restrictions. System calls (and any other kernel-mode routines running under the process environment of a user-mode process) can use system calls that pass all parameters by value. System calls and other kernel-mode routines must not start system calls that have one or more parameters passed by reference. Doing so can result in a system crash. This is because system calls with reference parameters assume that the referenced data area is in the user protection domain. As a result, these system calls must use special kernel services to access the data. However, these services are unsuccessful if the data area they are trying to access is not in the user protection domain.
This restriction does not apply to kernel processes. User-mode data access services can distinguish between kernel processes and user-mode processes in kernel mode. As a result, these services can access the referenced data areas accessed correctly when the caller is a kernel process.
Kernel processes cannot call the fork or exec system calls, among others. A list of the base operating system calls available to system calls or other routines in kernel mode is provided in System Calls Available to Kernel Extensions.
Most data accessed by system calls is pageable by default. This includes the system call code, static data, dynamically allocated data, and stack. As a result, a system call can be preempted in two ways:
In the latter case, even less-favored processes can run while the system call is waiting for the paging I/O to complete.
Error information returned by system calls differs from that returned by kernel services that are not system calls. System calls typically return a special value, such as -1 or NULL, to indicate that an error has occurred. When an error condition is to be returned, the ut_error field should be updated by the system call before returning from the system call function. The ut_error field is written using the setuerror kernel service.
Before actually calling the system call function, the system call handler sets the ut_error field to 0. Upon return from the system call function, the system call handler copies the value found in ut_error into the thread-specific errno variable if ut_error was nonzero. After setting the errno variable, the system call handler returns to user mode with the return code provided by the system call function.
Kernel-mode callers of system calls must be aware of this return code convention and use the getuerror kernel service to obtain the error value when an error indication is returned by the system call. When system calls are nested, the system call function called by the system call handler can return the error value provided by the nested system call function or can replace this value with a new one by using the setuerror kernel service.