Performance Management Guide

Compiler Optimization Techniques

The three main areas of source-code tuning are as follows:

Programming techniques that take advantage of the optimizing compilers and the system srchitecture.
BLAS, a library of Basic Linear Algebra Subroutines. If you have a numerically intensive program, these subroutines can provide considerable performance enhancement. An extension of BLAS is ESSL, the Engineering Scientific Subroutine Library. In addition to a subset of the BLAS library, ESSL includes other high-performance mathematical routines for chemistry, engineering, and physics. A Parallel ESSL (PESSL) exists for SMP machines.
Compiler options and the use of preprocessors like KAP and VAST, available from third-party vendors.

In addition to these source-code tuning techniques, the fdpr program restructures object code. The fdpr program is described in Restructuring Executable Programs with the fdpr Program.

Compiling with Optimization (-O, -O2, -O3, -O4, -O5, qstrict, -qhot, -qipa)

To produce a program that achieves good performance, the first step is to take advantage of the basic optimization features built into the compiler. Doing so can increase the speedup that comes from tuning your program and can remove the need to perform some kinds of tuning.

Recommendations

Follow these guidelines for optimization:

Use -O2 or -O3 -qstrict for any production-level FORTRAN, C, or C++ program you compile. For High Performance FORTRAN (HPF) programs, do not use the -qstrict option.
Use the -qhot option for programs where the hotspots are loops or array language. Always use the -qhot option for HPF programs.
Use the -qipa option near the end of the development cycle if compilation time is not a major consideration.

The -qipa option activates or customizes a class of optimizations known as interprocedural analysis. The -qipa option has several suboptions that are detailed in the compiler manual. It can be used in two ways:

The first method is to compile with the -qipa option during both the compile and link steps. During compilation, the compiler stores interprocedural analysis information in the .o file. During linking, the -qipa option causes a complete recompilation of the entire application.
The second method is to compile the program for profiling with the -p/-pg option (with or without -qipa), and run it on a typical set of data. The resulting data can then be fed into subsequent compilations with -qipa so that the compiler concentrates optimization in the seconds of the program that are most frequently used.

Using -O4 is equivalent to using -O3 -qipa with automatic generation of architecture and tuning option ideal for that platform. Using the -O5 flag is similar to -O4 except that -qipa= level = 2.

You gain the following benefits when you use compiler optimization:

Branch optimization: Rearranges the program code to minimize branching logic and to combine physically separate blocks of code.
Code motion: If variables used in a computation within a loop are not altered within the loop, the calculation can be performed outside of the loop and the results used within the loop.
Common subexpression elimination: In common expressions, the same value is recalculated in a subsequent expression. The duplicate expression can be eliminated by using the previous value.
Constant propagation: Constants used in an expression are combined, and new ones are generated. Some implicit conversions between integers and floating-point types are done.
Dead code elimination: Eliminates code that cannot be reached or where the results are not subsequently used.
Dead store elimination: Eliminates stores when the value stored is never referenced again. For example, if two stores to the same location have no intervening load, the first store is unnecessary and is removed.
Global register allocation: Allocates variables and expressions to available hardware registers using a "graph coloring" algorithm.
Inlining: Replaces function calls with actual program code
Instruction scheduling: Reorders instructions to minimize execution time
Interprocedural analysis: Uncovers relationships across function calls, and eliminates loads, stores, and computations that cannot be eliminated with more straightforward optimizations.
Invariant IF code floating (Unswitching): Removes invariant branching code from loops to make more opportunity for other optimizations.
Profile driven feedback: Results from sample program execution are used to improve optimization near conditional branches and in frequently executed code sections.
Reassociation: Rearranges the sequence of calculations in an array subscript expression, producing more candidates for common expression elimination.
Store motion: Moves store instructions out of loops.
Strength Reduction: Replaces less efficient instructions with more efficient ones. For example, in array subscripting, an add instruction replaces a multiply instruction.
Value numbering: Involves constant propagation, expression elimination, and folding of several instructions into a single instruction.

When to Compile without Optimization

Do not use the -O option for programs that you intend to debug with a symbolic debugger, regardless of whether you use the -g option. However, because optimization is so important to HPF programs, use -O3 -qhot for them even during debugging.

The optimizer rearranges assembler-language instructions, making it difficult to map individual instructions to a line of source code. If you compile with the -g option, this rearrangement may give the appearance that the source-level statements are executed in the wrong order when you use a symbolic debugger.

If your program produces incorrect results when it is compiled with any of the -O options, check your program for unintentionally aliased variables in procedure references.

Compiling for Specific Hardware Platforms (-qarch, -qtune)

Systems can use several type of processors. By using the -qarch and -qtune options, you can optimize programs for the special instructions and particular strengths of these processors.

Recommendations

Follow these guidelines for compiling for specific hardware platforms:

If your program will be run only on a single system, or on a group of systems with the same processor type, use the -qarch option to specify the processor type.
If your program will be run on systems with different processor types, and you can identify one processor type as the most important, use the appropriate -qarch and -qtune settings. XL FORTRAN and XL HPF users can use the xxlf and xxlhpf commands to select these settings interactively.
If your program is intended to run on the full range of processor implementations, and is not intended primarily for one processor type, do not use either -qarch or -qtune.

Compiling for Floating-Point Performance (-qfloat)

You can change some default floating-point options to enhance performance of floating-point intensive programs. Some of these options can affect conformance to floating-point standards. Using these options can change the results of computations, but in many cases the result is an increase in accuracy.

Recommendations

Follow these guidelines:

For single-precision programs on POWER family and POWER2 platforms, you can enhance performance while preserving accuracy by using these floating-point options:
```
-qfloat=fltint:rsqrt:hssngl
```
If your single-precision program is not memory-intensive (for example, if it does not access more data than the available cache space), you can obtain equal or better performance, and greater precision, by using:
```
-qfloat=fltint:rsqrt -qautodbl=dblpad4
```
For programs that do not contain single-precision variables, use -qfloat=rsqrt:fltint only. Note that -O3 without -qstrict automatically sets -qfloat=rsqrt:fltint.
Single-precision programs are generally more efficient than double-precision programs, so promoting default REAL values to REAL(8) can reduce performance. Use the following -qfloat suboptions:

Specifying Cache Sizes (-qcache)

If your program is intended to run exclusively on a single machine or configuration, you can help the compiler tune your program to the memory layout of that machine by using the FORTRAN -qcache option. You must also specify the -qhot option for -qcache to have any effect. The -qhot option uses the -qcache information to determine appropriate memory-management optimizations.

There are three types of cache: data, instruction, and combined. Models generally fall into two categories: those with both data and instruction caches, and those with a single, combined data/instruction cache. The TYPE suboption lets you identify which type of cache the -qcache option refers to.

The -qcache option can also be used to identify the size and set associativity of a model's level-2 cache and the Translation Lookaside Buffer (TLB), which is a table used to locate recently referenced pages of memory. In most cases, you do not need to specify the -qcache entry for a TLB unless your program uses more than 512 KB of data space.

There may be cases where a lower setting for the SIZE attribute gives enhanced performance, depending on the system load at the time of a run.

Expanding Procedure Calls Inline (-Q)

Inlining involves copying referenced procedures into the code from which they are referenced. This eliminates the calling overhead for inlined routines and enables the optimizer to perform other optimizations in the inlined routines.

For FORTRAN and C programs, you can specify the -Q option (along with -O2 or -O3) to have procedures inlined into their reference points.

Inlining enhances performance in some programs, while it degrades performance in others. A program with inlining might slow down because of larger code size, resulting in more cache misses and page faults, or because there are not enough registers to hold all the local variables in some combined routines.

If you use the -Q option, always check the performance of the version of your program compiled with -O3 and -Q to that compiled only with -O3. Performance of programs compiled with -Q might improve dramatically, deteriorate dramatically, or change little or not at all.

The compiler decides whether to inline procedures based on their size. You might be able to enhance your application's performance by using other criteria for inlining. For procedures that are unlikely to be referenced in a typical execution (for example, error-handling and debugging procedures), disable inlining selectively by using the -Q-names option. For procedures that are referenced within hot spots, specify the -Q+names option to ensure that those procedures are always inlined.

When to Use Dynamic Linking and Static Linking

The operating system provides facilities for creating and using dynamically linked shared libraries. With dynamic linking, external symbols referenced in user code and defined in a shared library are resolved by the loader at load time. When you compile a program that uses shared libraries, they are dynamically linked to your program by default.

The idea behind shared libraries is to have only one copy of commonly used routines and to maintain this common copy in a unique shared-library segment. These common routines can significantly reduce the size of executable programs, thereby saving disk space.

You can reduce the size of your programs by using dynamic linking, but there is usually a tradeoff in performance. The shared library code is not present in the executable image on disk, but is kept in a separate library file. Shared code is loaded into memory once in the shared library segment and shared by all processes that reference it. Dynamically linked libraries therefore reduce the amount of virtual storage used by your program, provided that several concurrently running applications (or copies of the same application) use the procedures provided in the shared library. They also reduce the amount of disk space required for your program provided that several different applications stored on a given system share a library. Other advantages of shared libraries are as follows:

Load time might be reduced because the shared library code might already be in memory.
Run-time performance can be enhanced because the operating system is less likely to page out shared library code that is being used by several applications, or copies of an application, rather than code that is only being used by a single application. As a result, fewer page faults occur.
The routines are not statically bound to the application but are dynamically bound when the application is loaded. This permits applications to automatically inherit changes to the shared libraries, without recompiling or rebinding.

Disadvantages of dynamic linking include the following:

From a performance viewpoint, there is "glue code" that is required in the executable program to access the shared segment. There is a performance cost in references to shared library routines of about eight machine cycles per reference. Programs that use shared libraries are usually slower than those that use statically-linked libraries.
A more subtle effect is a reduction in "locality of reference." You may be interested in only a few of the routines in a library, and these routines may be scattered widely in the virtual address space of the library. Thus, the total number of pages you need to touch to access all of your routines is significantly higher than if these routines were all bound directly into your executable program. One impact of this situation is that, if you are the only user of these routines, you experience more page faults to get them all into real memory. In addition, because more pages are touched, there is a greater likelihood of causing an instruction translation lookaside buffer (TLB) miss.
When a program references a limited number of procedures in a library, each page of the library that contains a referenced procedure must be individually paged into real memory. If the procedures are small enough that using static linking might have linked several procedures that are in different library pages into a single page, then dynamic linking may increase paging thus decreasing performance.
Dynamically linked programs are dependent on having a compatible library. If a library is changed (for example, a new compiler release may change a library), applications might have to be reworked to be made compatible with the new version of the library. If a library is removed from the system, programs using that library will no longer work.

In statically-linked programs, all code is contained in a single executable module. Library references are more efficient because the library procedures are statically linked into the program. Static linking increases the file size of your program, and it may increase the code size in memory if other applications, or other copies of your application, are running on the system.

The cc command defaults to the shared-library option. To override the default, when you compile your programs to create statically-linked object files, use the -bnso option as follows:

cc xxx.c -o xxx.noshr -O -bnso -bI:/lib/syscalls.exp

This option forces the linker to place the library procedures your program references into the program's object file. The /lib/syscalIs.exp file contains the names of system routines that must be imported to your program from the system. This file must be specified for static linking. The routines that it names are imported automatically by libc.a for dynamic linking, so you do not need to specify this file during dynamic linking. For further details on these options, see Appendix B. Efficient Use of the ld Command and the Id command.

Determining If Nonshared Libraries Help Performance

One method of determining whether your application is sensitive to the shared-library approach is to recompile your executable program using the nonshare option. If the performance is significantly better, you may want to consider trading off the other advantages of shared libraries for the performance gain. Be sure to measure performance in an authentic environment, however. A program that had been bound nonshared might run faster as a single instance in a lightly loaded machine. That same program, when used by a number of users simultaneously, might increase real memory usage enough to slow down the whole workload.

Specifying the Link Order to Reduce Paging for Large Programs

During the linkage phase of program compilation, the linker relocates program units in an attempt to improve locality of reference. For example, if a procedure references another procedure, the linker may make the procedures adjacent in the load module, so that both procedures fit into the same page of virtual memory. This can reduce paging overhead. When the first procedure is referenced for the first time and the page containing it is brought into real memory, the second procedure is ready for use without additional paging overhead.

In very large programs where paging occurs excessively for pages of your program's code, you may decide to impose a particular link order on the linker. You can do this by arranging control sections in the order you want them linked, and by using the -bnoobjreorder option to prevent the linker from reordering. A control section or CSECT is the smallest replaceable unit of code or data in an XCOFF object module. For further details, see the AIX 5L Version 5.2 Files Reference.

However, there are a number of risks involved in specifying a link order. Any link reordering should always be followed by thorough performance testing to demonstrate that your link order gives superior results for your program over the link order that the linker chooses. Take the following points into account before you decide to establish your own link order:

You must determine the link order for all CSECTs in your program. The CSECTs must be presented to the linker in the order in which you want to link them. In a large program, such an ordering effort is considerable and prone to errors.
A performance benefit observed during development of a program can become a performance loss later on, because the changing code size can cause CSECTs that were previously located together in a page to be split into separate pages.
Reordering can change the frequency of instruction cache-line collisions. On implementations with an instruction cache or combined data and instruction cache that is two-way set-associative, any line of program code can only be stored in one of two lines of the cache. If three or more short, interdependent procedures have the same cache-congruence class, instruction-cache thrashing can reduce performance. Reordering can cause cache-line collisions where none occurred before. It can also eliminate cache-line collisions that occur when -bnoobjreorder is not specified.

If you attempt to tune the link order of your programs, always test performance on a system where total real storage and memory utilization by other programs are similar to the anticipated working environment. A link order that works on a quiet system with few tasks running can cause page thrashing on a busier system.

Calling the BLAS and ESSL Libraries

The Basic Linear Algebra Subroutines (BLAS) provide a high level of performance for linear algebraic equations in matrix-matrix, matrix-vector, and vector-vector operations. The Engineering and Scientific Subroutine Library (ESSL), contains a more comprehensive set of subroutines, all of which are tuned for the POWER family, POWER2, and PowerPC architecture. The BLAS and ESSL subroutines can save you considerable effort in tuning many arithmetic operations, and still provide performance that is often better than that obtained by hand-tuning or by automatic optimization of hand-coded arithmetic operations. You can call functions from both libraries from FORTRAN, C, and C++ programs.

The BLAS library is a collection of Basic Linear Algebra Subroutines that have been highly tuned for the underlying architecture. The BLAS subset is provided with the operating system (/lib/libblas.a).

Users should use this library for their matrix and vector operations, because they are tuned to a degree that users are unlikely to achieve on their own.

The BLAS routines are designed to be called from FORTRAN programs, but can be used with C programs. Care must be taken due to the language difference when referencing matrixes. For example, FORTRAN stores arrays in column major order, while C uses row major order.

To include the BLAS library, which exists in /lib/libblas.a, use the -lblas option on the compiler statement (xlf -O prog.f -lblas). If calling BLAS from a C program, also include the -lxlf option for the FORTRAN library (cc -O prog.c -lblas -lxlf).

ESSL is a more advanced library that includes a variety of mathematical functions used in the areas of engineering, chemistry and physics.

Advantages to using the BLAS or ESSL subroutines are as follows:

BLAS and ESSL subroutine calls are easier to code than the operations they replace.
BLAS and ESSL subroutines are portable across different platforms. The subroutine names and calling sequences are standardized.
BLAS code is likely to perform well on all platforms. The internal coding of the routines is usually platform-specific so that the code is closely tied to the architecture's performance characteristics.

In an example program, the following nine lines of FORTRAN code:

do l=1,control
do j=1,control
        xmult=0.d0
        do k=1,control
                xmult=xmult+a(i,k)*a(k,j)
        end do
        b(i,j)=xmult
end do
end do

were replaced by the following line of FORTRAN that calls a BLAS routine:

call dgemm (`n','n',control,control,control,1,d0,a, control,a,1control,1.d0,b,control)

The following performance enhancement was observed:

Array Dimension	MULT Elapsed	BLAS Elapsed	Ratio
101 x 101	.1200	.0500	2.40
201 x 201	.8900	.3700	2.41
301 x 301	16.4400	1.2300	13.37
401 x 401	65.3500	2.8700	22.77
501 x 501	170.4700	5.4100	31.51

This example demonstrates how a program using matrix multiplication operations could better use a level 3 BLAS routine for enhanced performance. Note that the improvement increases as the array size increases.

Profile Directed Feedback (PDF)

PDF is a compiler option to do further procedural level optimization such as directing register allocations, instruction scheduling, and basic block rearrangement. To use PDF, do the following:

Compile the source files in a program with -qpdf1 (the function main() must be compiled also). The -lpdf option is required during the link step. All the other compilation options used must also be used during step 3.
Run the program all the way through a typical data set. The program records profiling information when it exits into a file called .__BLOCKS in the directory specified by the PDFDIR environment variable or in the current working directory if that variable is not set. You can run the program multiple times with different data sets, and the profiling information is accumulated to provide an accurate count of how often branches are taken and blocks of code are executed. It is important to use data that is representative of the data used during a typical run of your finished program.
Recompile the program using the same compiler options as in step 1, but change -qpdf1 to -qpdf2. Remember that -L and -l are linker options, and you can change them at this point; in particular, omit the -lpdf option. In this second compilation, the accumulated profiling information is used to fine-tune the optimizations. The resulting program contains no profiling overhead and runs at full speed.

Two commands are available for managing the PDFDIR directory:

resetpdf pathname: Clears all profiling information (but does not remove the data files) from the pathname directory. If pathname is not specified, from the PDFDIR directory; or if PDFDIR is not set, from the current directory. When you make changes to the application and recompile some files, the profiling information for these files is automatically reset. Run the resetpdf command to reset the profiling information for the entire application, after making significant changes that may affect execution counts for parts of the program that were not recompiled.
cleanpdf pathname: Removes all profiling information from the pathname or PDFDIR or current directory. Removing the profile information reduces the run-time overhead if you change the program and then go through the PDF process again. Run this program after compiling with -qpdf2.

The fdpr Command

The fdpr command can rearrange the code within a compiled executable program to improve branching performance, move rarely used code away from program hot spots, and do other global optimizations. It works best for large programs with many conditional tests, or highly structured programs with multiple, sparsely placed procedures. The fdpr command is described in Restructuring Executable Programs with the fdpr Program.