The three main areas of source-code tuning are as follows:
In addition to these source-code tuning techniques, the fdpr program restructures object code. The fdpr program is described in Restructuring Executable Programs with the fdpr Program.
To produce a program that achieves good performance, the first step is to take advantage of the basic optimization features built into the compiler. Doing so can increase the speedup that comes from tuning your program and can remove the need to perform some kinds of tuning.
Follow these guidelines for optimization:
The -qipa option activates or customizes a class of optimizations known as interprocedural analysis. The -qipa option has several suboptions that are detailed in the compiler manual. It can be used in two ways:
You gain the following benefits when you use compiler optimization:
Do not use the -O option for programs that you intend to debug with a symbolic debugger, regardless of whether you use the -g option. However, because optimization is so important to HPF programs, use -O3 -qhot for them even during debugging.
The optimizer rearranges assembler-language instructions, making it difficult to map individual instructions to a line of source code. If you compile with the -g option, this rearrangement may give the appearance that the source-level statements are executed in the wrong order when you use a symbolic debugger.
If your program produces incorrect results when it is compiled with any of the -O options, check your program for unintentionally aliased variables in procedure references.
Systems can use several type of processors. By using the -qarch and -qtune options, you can optimize programs for the special instructions and particular strengths of these processors.
Follow these guidelines for compiling for specific hardware platforms:
You can change some default floating-point options to enhance performance of floating-point intensive programs. Some of these options can affect conformance to floating-point standards. Using these options can change the results of computations, but in many cases the result is an increase in accuracy.
Follow these guidelines:
-qfloat=fltint:rsqrt:hssngl
If your single-precision program is not memory-intensive (for example, if it does not access more data than the available cache space), you can obtain equal or better performance, and greater precision, by using:
-qfloat=fltint:rsqrt -qautodbl=dblpad4
For programs that do not contain single-precision variables, use -qfloat=rsqrt:fltint only. Note that -O3 without -qstrict automatically sets -qfloat=rsqrt:fltint.
-qfloat=hssngl:fltint:rsqrt
If your program is intended to run exclusively on a single machine or configuration, you can help the compiler tune your program to the memory layout of that machine by using the FORTRAN -qcache option. You must also specify the -qhot option for -qcache to have any effect. The -qhot option uses the -qcache information to determine appropriate memory-management optimizations.
There are three types of cache: data, instruction, and combined. Models generally fall into two categories: those with both data and instruction caches, and those with a single, combined data/instruction cache. The TYPE suboption lets you identify which type of cache the -qcache option refers to.
The -qcache option can also be used to identify the size and set associativity of a model's level-2 cache and the Translation Lookaside Buffer (TLB), which is a table used to locate recently referenced pages of memory. In most cases, you do not need to specify the -qcache entry for a TLB unless your program uses more than 512 KB of data space.
There may be cases where a lower setting for the SIZE attribute gives enhanced performance, depending on the system load at the time of a run.
Inlining involves copying referenced procedures into the code from which they are referenced. This eliminates the calling overhead for inlined routines and enables the optimizer to perform other optimizations in the inlined routines.
For FORTRAN and C programs, you can specify the -Q option (along with -O2 or -O3) to have procedures inlined into their reference points.
Inlining enhances performance in some programs, while it degrades performance in others. A program with inlining might slow down because of larger code size, resulting in more cache misses and page faults, or because there are not enough registers to hold all the local variables in some combined routines.
If you use the -Q option, always check the performance of the version of your program compiled with -O3 and -Q to that compiled only with -O3. Performance of programs compiled with -Q might improve dramatically, deteriorate dramatically, or change little or not at all.
The compiler decides whether to inline procedures based on their size. You might be able to enhance your application's performance by using other criteria for inlining. For procedures that are unlikely to be referenced in a typical execution (for example, error-handling and debugging procedures), disable inlining selectively by using the -Q-names option. For procedures that are referenced within hot spots, specify the -Q+names option to ensure that those procedures are always inlined.
The operating system provides facilities for creating and using dynamically linked shared libraries. With dynamic linking, external symbols referenced in user code and defined in a shared library are resolved by the loader at load time. When you compile a program that uses shared libraries, they are dynamically linked to your program by default.
The idea behind shared libraries is to have only one copy of commonly used routines and to maintain this common copy in a unique shared-library segment. These common routines can significantly reduce the size of executable programs, thereby saving disk space.
You can reduce the size of your programs by using dynamic linking, but there is usually a tradeoff in performance. The shared library code is not present in the executable image on disk, but is kept in a separate library file. Shared code is loaded into memory once in the shared library segment and shared by all processes that reference it. Dynamically linked libraries therefore reduce the amount of virtual storage used by your program, provided that several concurrently running applications (or copies of the same application) use the procedures provided in the shared library. They also reduce the amount of disk space required for your program provided that several different applications stored on a given system share a library. Other advantages of shared libraries are as follows:
Disadvantages of dynamic linking include the following:
In statically-linked programs, all code is contained in a single executable module. Library references are more efficient because the library procedures are statically linked into the program. Static linking increases the file size of your program, and it may increase the code size in memory if other applications, or other copies of your application, are running on the system.
The cc command defaults to the shared-library option. To override the default, when you compile your programs to create statically-linked object files, use the -bnso option as follows:
cc xxx.c -o xxx.noshr -O -bnso -bI:/lib/syscalls.exp
This option forces the linker to place the library procedures your program references into the program's object file. The /lib/syscalIs.exp file contains the names of system routines that must be imported to your program from the system. This file must be specified for static linking. The routines that it names are imported automatically by libc.a for dynamic linking, so you do not need to specify this file during dynamic linking. For further details on these options, see Appendix B. Efficient Use of the ld Command and the Id command.
One method of determining whether your application is sensitive to the shared-library approach is to recompile your executable program using the nonshare option. If the performance is significantly better, you may want to consider trading off the other advantages of shared libraries for the performance gain. Be sure to measure performance in an authentic environment, however. A program that had been bound nonshared might run faster as a single instance in a lightly loaded machine. That same program, when used by a number of users simultaneously, might increase real memory usage enough to slow down the whole workload.
During the linkage phase of program compilation, the linker relocates program units in an attempt to improve locality of reference. For example, if a procedure references another procedure, the linker may make the procedures adjacent in the load module, so that both procedures fit into the same page of virtual memory. This can reduce paging overhead. When the first procedure is referenced for the first time and the page containing it is brought into real memory, the second procedure is ready for use without additional paging overhead.
In very large programs where paging occurs excessively for pages of your program's code, you may decide to impose a particular link order on the linker. You can do this by arranging control sections in the order you want them linked, and by using the -bnoobjreorder option to prevent the linker from reordering. A control section or CSECT is the smallest replaceable unit of code or data in an XCOFF object module. For further details, see the AIX 5L Version 5.1 Files Reference.
However, there are a number of risks involved in specifying a link order. Any link reordering should always be followed by thorough performance testing to demonstrate that your link order gives superior results for your program over the link order that the linker chooses. Take the following points into account before you decide to establish your own link order:
If you attempt to tune the link order of your programs, always test performance on a system where total real storage and memory utilization by other programs are similar to the anticipated working environment. A link order that works on a quiet system with few tasks running can cause page thrashing on a busier system.
The Basic Linear Algebra Subroutines (BLAS) provide a high level of performance for linear algebraic equations in matrix-matrix, matrix-vector, and vector-vector operations. The Engineering and Scientific Subroutine Library (ESSL), contains a more comprehensive set of subroutines, all of which are tuned for the POWER family, POWER2, and PowerPC architecture. The BLAS and ESSL subroutines can save you considerable effort in tuning many arithmetic operations, and still provide performance that is often better than that obtained by hand-tuning or by automatic optimization of hand-coded arithmetic operations. You can call functions from both libraries from FORTRAN, C, and C++ programs.
The BLAS library is a collection of Basic Linear Algebra Subroutines that have been highly tuned for the underlying architecture. The BLAS subset is provided with the operating system (/lib/libblas.a).
Users should use this library for their matrix and vector operations, because they are tuned to a degree that users are unlikely to achieve on their own.
The BLAS routines are designed to be called from FORTRAN programs, but can be used with C programs. Care must be taken due to the language difference when referencing matrixes. For example, FORTRAN stores arrays in column major order, while C uses row major order.
To include the BLAS library, which exists in /lib/libblas.a, use the -lblas option on the compiler statement (xlf -O prog.f -lblas). If calling BLAS from a C program, also include the -lxlf option for the FORTRAN library (cc -O prog.c -lblas -lxlf).
ESSL is a more advanced library that includes a variety of mathematical functions used in the areas of engineering, chemistry and physics.
Advantages to using the BLAS or ESSL subroutines are as follows:
In an example program, the following nine lines of FORTRAN code:
do l=1,control do j=1,control xmult=0.d0 do k=1,control xmult=xmult+a(i,k)*a(k,j) end do b(i,j)=xmult end do end do
were replaced by the following line of FORTRAN that calls a BLAS routine:
call dgemm (`n','n',control,control,control,1,d0,a, control,a,1control,1.d0,b,control)
The following performance enhancement was observed:
Array Dimension | MULT Elapsed | BLAS Elapsed | Ratio |
101 x 101 | .1200 | .0500 | 2.40 |
201 x 201 | .8900 | .3700 | 2.41 |
301 x 301 | 16.4400 | 1.2300 | 13.37 |
401 x 401 | 65.3500 | 2.8700 | 22.77 |
501 x 501 | 170.4700 | 5.4100 | 31.51 |
This example demonstrates how a program using matrix multiplication operations could better use a level 3 BLAS routine for enhanced performance. Note that the improvement increases as the array size increases.
PDF is a compiler option to do further procedural level optimization such as directing register allocations, instruction scheduling, and basic block rearrangement. To use PDF, do the following:
Two commands are available for managing the PDFDIR directory:
The fdpr command can rearrange the code within a compiled executable program to improve branching performance, move rarely used code away from program hot spots, and do other global optimizations. It works best for large programs with many conditional tests, or highly structured programs with multiple, sparsely placed procedures. The fdpr command is described in Restructuring Executable Programs with the fdpr Program.