[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]

Performance Management Guide


Compiler Optimization Techniques

The three main areas of source-code tuning are as follows:

In addition to these source-code tuning techniques, the fdpr program restructures object code. The fdpr program is described in Restructuring Executable Programs with the fdpr Program.

Compiling with Optimization (-O, -O2, -O3, -qstrict, -qhot, -qipa)

To produce a program that achieves good performance, the first step is to take advantage of the basic optimization features built into the compiler. Doing so can increase the speedup that comes from tuning your program and can remove the need to perform some kinds of tuning.

Recommendations

Follow these guidelines for optimization:

The -qipa option activates or customizes a class of optimizations known as interprocedural analysis. The -qipa option has several suboptions that are detailed in the compiler manual. It can be used in two ways:

You gain the following benefits when you use compiler optimization:

Branch optimization
Rearranges the program code to minimize branching logic and to combine physically separate blocks of code.

Code motion
If variables used in a computation within a loop are not altered within the loop, the calculation can be performed outside of the loop and the results used within the loop.

Common subexpression elimination
In common expressions, the same value is recalculated in a subsequent expression. The duplicate expression can be eliminated by using the previous value.

Constant propagation
Constants used in an expression are combined, and new ones are generated. Some implicit conversions between integers and floating-point types are done.

Dead code elimination
Eliminates code that cannot be reached or where the results are not subsequently used.

Dead store elimination
Eliminates stores when the value stored is never referenced again. For example, if two stores to the same location have no intervening load, the first store is unnecessary and is removed.

Global register allocation
Allocates variables and expressions to available hardware registers using a "graph coloring" algorithm.

Inlining
Replaces function calls with actual program code

Instruction scheduling
Reorders instructions to minimize execution time

Interprocedural analysis
Uncovers relationships across function calls, and eliminates loads, stores, and computations that cannot be eliminated with more straightforward optimizations.

Invariant IF code floating (Unswitching)
Removes invariant branching code from loops to make more opportunity for other optimizations.

Profile driven feedback
Results from sample program execution are used to improve optimization near conditional branches and in frequently executed code sections.

Reassociation
Rearranges the sequence of calculations in an array subscript expression, producing more candidates for common expression elimination.

Store motion
Moves store instructions out of loops.

Strength Reduction
Replaces less efficient instructions with more efficient ones. For example, in array subscripting, an add instruction replaces a multiply instruction.

Value numbering
Involves constant propagation, expression elimination, and folding of several instructions into a single instruction.

When to Compile without Optimization

Do not use the -O option for programs that you intend to debug with a symbolic debugger, regardless of whether you use the -g option. However, because optimization is so important to HPF programs, use -O3 -qhot for them even during debugging.

The optimizer rearranges assembler-language instructions, making it difficult to map individual instructions to a line of source code. If you compile with the -g option, this rearrangement may give the appearance that the source-level statements are executed in the wrong order when you use a symbolic debugger.

If your program produces incorrect results when it is compiled with any of the -O options, check your program for unintentionally aliased variables in procedure references.

Compiling for Specific Hardware Platforms (-qarch, -qtune)

Systems can use several type of processors. By using the -qarch and -qtune options, you can optimize programs for the special instructions and particular strengths of these processors.

Recommendations

Follow these guidelines for compiling for specific hardware platforms:

Compiling for Floating-Point Performance (-qfloat)

You can change some default floating-point options to enhance performance of floating-point intensive programs. Some of these options can affect conformance to floating-point standards. Using these options can change the results of computations, but in many cases the result is an increase in accuracy.

Recommendations

Follow these guidelines:

Specifying Cache Sizes (-qcache)

If your program is intended to run exclusively on a single machine or configuration, you can help the compiler tune your program to the memory layout of that machine by using the FORTRAN -qcache option. You must also specify the -qhot option for -qcache to have any effect. The -qhot option uses the -qcache information to determine appropriate memory-management optimizations.

There are three types of cache: data, instruction, and combined. Models generally fall into two categories: those with both data and instruction caches, and those with a single, combined data/instruction cache. The TYPE suboption lets you identify which type of cache the -qcache option refers to.

The -qcache option can also be used to identify the size and set associativity of a model's level-2 cache and the Translation Lookaside Buffer (TLB), which is a table used to locate recently referenced pages of memory. In most cases, you do not need to specify the -qcache entry for a TLB unless your program uses more than 512 KB of data space.

There may be cases where a lower setting for the SIZE attribute gives enhanced performance, depending on the system load at the time of a run.

Expanding Procedure Calls Inline (-Q)

Inlining involves copying referenced procedures into the code from which they are referenced. This eliminates the calling overhead for inlined routines and enables the optimizer to perform other optimizations in the inlined routines.

For FORTRAN and C programs, you can specify the -Q option (along with -O2 or -O3) to have procedures inlined into their reference points.

Inlining enhances performance in some programs, while it degrades performance in others. A program with inlining might slow down because of larger code size, resulting in more cache misses and page faults, or because there are not enough registers to hold all the local variables in some combined routines.

If you use the -Q option, always check the performance of the version of your program compiled with -O3 and -Q to that compiled only with -O3. Performance of programs compiled with -Q might improve dramatically, deteriorate dramatically, or change little or not at all.

The compiler decides whether to inline procedures based on their size. You might be able to enhance your application's performance by using other criteria for inlining. For procedures that are unlikely to be referenced in a typical execution (for example, error-handling and debugging procedures), disable inlining selectively by using the -Q-names option. For procedures that are referenced within hot spots, specify the -Q+names option to ensure that those procedures are always inlined.

When to Use Dynamic Linking and Static Linking

The operating system provides facilities for creating and using dynamically linked shared libraries. With dynamic linking, external symbols referenced in user code and defined in a shared library are resolved by the loader at load time. When you compile a program that uses shared libraries, they are dynamically linked to your program by default.

The idea behind shared libraries is to have only one copy of commonly used routines and to maintain this common copy in a unique shared-library segment. These common routines can significantly reduce the size of executable programs, thereby saving disk space.

You can reduce the size of your programs by using dynamic linking, but there is usually a tradeoff in performance. The shared library code is not present in the executable image on disk, but is kept in a separate library file. Shared code is loaded into memory once in the shared library segment and shared by all processes that reference it. Dynamically linked libraries therefore reduce the amount of virtual storage used by your program, provided that several concurrently running applications (or copies of the same application) use the procedures provided in the shared library. They also reduce the amount of disk space required for your program provided that several different applications stored on a given system share a library. Other advantages of shared libraries are as follows:

Disadvantages of dynamic linking include the following:

In statically-linked programs, all code is contained in a single executable module. Library references are more efficient because the library procedures are statically linked into the program. Static linking increases the file size of your program, and it may increase the code size in memory if other applications, or other copies of your application, are running on the system.

The cc command defaults to the shared-library option. To override the default, when you compile your programs to create statically-linked object files, use the -bnso option as follows:

cc xxx.c -o xxx.noshr -O -bnso -bI:/lib/syscalls.exp

This option forces the linker to place the library procedures your program references into the program's object file. The /lib/syscalIs.exp file contains the names of system routines that must be imported to your program from the system. This file must be specified for static linking. The routines that it names are imported automatically by libc.a for dynamic linking, so you do not need to specify this file during dynamic linking. For further details on these options, see Appendix B. Efficient Use of the ld Command and the Id command.

Determining If Nonshared Libraries Help Performance

One method of determining whether your application is sensitive to the shared-library approach is to recompile your executable program using the nonshare option. If the performance is significantly better, you may want to consider trading off the other advantages of shared libraries for the performance gain. Be sure to measure performance in an authentic environment, however. A program that had been bound nonshared might run faster as a single instance in a lightly loaded machine. That same program, when used by a number of users simultaneously, might increase real memory usage enough to slow down the whole workload.

Specifying the Link Order to Reduce Paging for Large Programs

During the linkage phase of program compilation, the linker relocates program units in an attempt to improve locality of reference. For example, if a procedure references another procedure, the linker may make the procedures adjacent in the load module, so that both procedures fit into the same page of virtual memory. This can reduce paging overhead. When the first procedure is referenced for the first time and the page containing it is brought into real memory, the second procedure is ready for use without additional paging overhead.

In very large programs where paging occurs excessively for pages of your program's code, you may decide to impose a particular link order on the linker. You can do this by arranging control sections in the order you want them linked, and by using the -bnoobjreorder option to prevent the linker from reordering. A control section or CSECT is the smallest replaceable unit of code or data in an XCOFF object module. For further details, see the AIX 5L Version 5.1 Files Reference.

However, there are a number of risks involved in specifying a link order. Any link reordering should always be followed by thorough performance testing to demonstrate that your link order gives superior results for your program over the link order that the linker chooses. Take the following points into account before you decide to establish your own link order:

If you attempt to tune the link order of your programs, always test performance on a system where total real storage and memory utilization by other programs are similar to the anticipated working environment. A link order that works on a quiet system with few tasks running can cause page thrashing on a busier system.

Calling the BLAS and ESSL Libraries

The Basic Linear Algebra Subroutines (BLAS) provide a high level of performance for linear algebraic equations in matrix-matrix, matrix-vector, and vector-vector operations. The Engineering and Scientific Subroutine Library (ESSL), contains a more comprehensive set of subroutines, all of which are tuned for the POWER family, POWER2, and PowerPC architecture. The BLAS and ESSL subroutines can save you considerable effort in tuning many arithmetic operations, and still provide performance that is often better than that obtained by hand-tuning or by automatic optimization of hand-coded arithmetic operations. You can call functions from both libraries from FORTRAN, C, and C++ programs.

The BLAS library is a collection of Basic Linear Algebra Subroutines that have been highly tuned for the underlying architecture. The BLAS subset is provided with the operating system (/lib/libblas.a).

Users should use this library for their matrix and vector operations, because they are tuned to a degree that users are unlikely to achieve on their own.

The BLAS routines are designed to be called from FORTRAN programs, but can be used with C programs. Care must be taken due to the language difference when referencing matrixes. For example, FORTRAN stores arrays in column major order, while C uses row major order.

To include the BLAS library, which exists in /lib/libblas.a, use the -lblas option on the compiler statement (xlf -O prog.f -lblas). If calling BLAS from a C program, also include the -lxlf option for the FORTRAN library (cc -O prog.c -lblas -lxlf).

ESSL is a more advanced library that includes a variety of mathematical functions used in the areas of engineering, chemistry and physics.

Advantages to using the BLAS or ESSL subroutines are as follows:

In an example program, the following nine lines of FORTRAN code:

do l=1,control
do j=1,control
        xmult=0.d0
        do k=1,control
                xmult=xmult+a(i,k)*a(k,j)
        end do
        b(i,j)=xmult
end do
end do

were replaced by the following line of FORTRAN that calls a BLAS routine:

call dgemm (`n','n',control,control,control,1,d0,a, control,a,1control,1.d0,b,control)

The following performance enhancement was observed:

Array Dimension MULT Elapsed BLAS Elapsed Ratio
101 x 101 .1200 .0500 2.40
201 x 201 .8900 .3700 2.41
301 x 301 16.4400 1.2300 13.37
401 x 401 65.3500 2.8700 22.77
501 x 501 170.4700 5.4100 31.51

This example demonstrates how a program using matrix multiplication operations could better use a level 3 BLAS routine for enhanced performance. Note that the improvement increases as the array size increases.

Profile Directed Feedback (PDF)

PDF is a compiler option to do further procedural level optimization such as directing register allocations, instruction scheduling, and basic block rearrangement. To use PDF, do the following:

  1. Compile the source files in a program with -qpdf1 (the function main() must be compiled also). The -lpdf option is required during the link step. All the other compilation options used must also be used during step 3.
  2. Run the program all the way through a typical data set. The program records profiling information when it exits into a file called .__BLOCKS in the directory specified by the PDFDIR environment variable or in the current working directory if that variable is not set. You can run the program multiple times with different data sets, and the profiling information is accumulated to provide an accurate count of how often branches are taken and blocks of code are executed. It is important to use data that is representative of the data used during a typical run of your finished program.
  3. Recompile the program using the same compiler options as in step 1, but change -qpdf1 to -qpdf2. Remember that -L and -l are linker options, and you can change them at this point; in particular, omit the -lpdf option. In this second compilation, the accumulated profiling information is used to fine-tune the optimizations. The resulting program contains no profiling overhead and runs at full speed.

Two commands are available for managing the PDFDIR directory:

resetpdf pathname
Clears all profiling information (but does not remove the data files) from the pathname directory. If pathname is not specified, from the PDFDIR directory; or if PDFDIR is not set, from the current directory. When you make changes to the application and recompile some files, the profiling information for these files is automatically reset. Run the resetpdf command to reset the profiling information for the entire application, after making significant changes that may affect execution counts for parts of the program that were not recompiled.

cleanpdf pathname
Removes all profiling information from the pathname or PDFDIR or current directory. Removing the profile information reduces the run-time overhead if you change the program and then go through the PDF process again. Run this program after compiling with -qpdf2.

The fdpr Command

The fdpr command can rearrange the code within a compiled executable program to improve branching performance, move rarely used code away from program hot spots, and do other global optimizations. It works best for large programs with many conditional tests, or highly structured programs with multiple, sparsely placed procedures. The fdpr command is described in Restructuring Executable Programs with the fdpr Program.


[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]