C++ OPTIMIZATION WITH AIX 4.1.X
ITEM: RTA000052828
QUESTION:
A question came up from a developer on C Set ++ for AIX 4.1.X (5765-
421). I know that the compiler will now optimize the C for a per-
formance improvement, but on average, what percentage improvement can
you expect (what have we seen)?
Another question is, can you give me an idea of how well C++ assists
in decomposing the application into multi-threads (to take advantage
of our SMP)? The fdpr tool that's packaged with IBM Performance Aide
2.1 (5696-899) has a performance optimizer that can greatly improve
things, but does it address SMP optimization and programming structure
assistance? Also, how does C++ do without the toolbox and optimizer.
What have you seen to give the best results, both uni and smp??
---------- ---------- ---------- --------- ---------- ----------
A: 1) Q: I know that the compiler will now optimize the C for a
performance improvement...
A: This has always been the case. The RISC System/6000 CPU
development team worked closely with the compiler
development team; thus, the compiler developers were
able to communicate what they needed to the people
developing the the hardware. Normally companies work in
the reverse order - they develop compiler backends to
optimize code to a given architecture. In contrast,
IBM, by and large, developed a highly optimizing compiler
and then developed a hardware which implemented the
resulting machine code. The result is one of the best
optimizing compilers in the world.
2) Q: ...on average, what percentage improvement can you expect?
A: Percentage improvement over what? If you are asking what
percentage improvement does the new compiler offer over
the AIX 3.2.5 compiler, then I must inform you that this
information is not available.
If you would like to know how much faster optimized C
code runs on AIX 4.1 as compared to unoptimized C code,
then this is highly dependent on your application.
Some applications may see little benefit from
optimization, while others may run 5 times faster with
optimization.
IBM has not published benchmarks comparing performance
of C programs with and without optimization, but
I executed a small test on an AIX 4.1 machine to reflect
the effect of the optimizer on numerical applications.
I tested the amount of time it takes to calculate the
dot product of 2 500,000-element double precision
vectors. This operation is one of the cornerstones of
numerical computation (numerical computation is one of
the primary clients of high performance computing). I
wrote the application with no by-hand optimization, and
I compiled it with and without the -O3 optimization
flag. The optimized version ran about 45 percent faster
(nearly twice as fast) as the unoptimized version.
I then added some simple optimizations by hand to the
dot product operation. The code produced by -O3 option
was 33 percent faster than my own optimized code.
I would like to repeat the following points for
emphasis:
- It is not sensible to speak of average performance
improvements resulting from optimized code. We can
develop some sort of average improvements to be
expected by classes of programs, like numerically
intensive programs or database programs. An average
of all these application classes would not provide a
useful number since a typical customer runs a large
proportion of applications from a single class.
- IBM has not published the information you need. The
test that I ran may not even prove to be typical of
the class of numerical applications. If you must
have more reliable information, you can either find
copies of benchmark source code (for database-class
benchmarking I would recommend obtaining the TPC
benchmarks and for numerical-class benchmarking I
would recommend obtaining the LINPACK or the SPECfp
benchmarks) or contacting IBM's benchmarking center
to set up a contract to perform these benchmarks for
you. If you would like to set up such a contract,
then please respond back.
3) Q: How well does C++ assist in decomposing applications
into multi-threads (to take advantage of our SMP)?
A: The C++ compiler does not assist decomposing applications
into multiple threads. The compiler will not generate a
threaded machine code program from a nonthreaded C++
program.
4) Q: Does the FDPR tool address SMP optimization and
programming structure assistance?
A: No, FDPR does not address SMP optimization. All
optimizers address programming structure assistance in
that they restructure your code. Optimizers will
attempt to move apart instructions which depend on each
other, thereby allowing parallel execution in a single
CPU. They will also unroll loops, instantiate small
functions inline, delete unnecessary code, and combine
multiple operations into a single instruction (for
example, RISC System compilers can combine a multiply
operation and an add operation into a single
multiply-add machine instruction).
The FDPR's advantage over regular optimization is,
among other things, the ability to reorder instructions
globally. Normal optimizers may reorder the execution
of instructions within a single function. The FDPR
has the ability to determine if it is beneficial to
reorder instructions which lie on opposite sides of
function boundaries. This is a full order of magnitude
mode difficult than normal optimization.
6) Q: How does C++ do without the toolbox and optimizer?
A: I am not certain of what information you require here.
I assume that you would like a comparison between code
compiled with and without the FDPR optimizer.
IBM's published comparison is as follows:
- FDPR improves performance up to 73 percent (typically
10 - 20 percent).
- FDPR reduces text memory requirements up to 61
percent (typically 20 - 30 percent).
I realize that I may have misinterpreted this question.
If so, please respond back with a clarification. If I
have indeed misinterpreted this question, then I am
sorry for the resulting delay.
8) Q: What have you seen to give the best results, both
uniprocessor and SMP?
A: The SMP will run a single process faster if its code is
threaded. The SMP allocates work in thread granularity.
Different processes and different threads of the same
process can run on different processors simultaneously.
Thus, developing a threaded process will make a process
run faster on an SMP machine (so long as the threads
spend most of the time executing simultaneously).
Splitting a single process application into multiple
processes will make an application run faster on an SMP
machine (so long as the processes spend most of their
time executing simultaneously).
Since our thread support only goes so far as to provide
the ability to develop threaded code, all other issues
relate both to uniprocessors and multiprocessors.
The "-O3" option is outstanding. I have found that I
cannot develop code by hand that is as efficient as the
code output by the compiler with this option.
The "-qarch=" option also helps significantly in many
situations, especially on PowerPC architectures. Use
"-qarch=ppc" for PowerPCs, "-qarch=pwr" for POWER
architecture CPUs, and "-qarch=pwr2" for POWER2s. Code
compiled with this option will not run on all RISC
Systems - it will only run on a machine with the target
architecture (actually, "-qarch=pwr" will run on both
POWER and POWER2).
I have not used the FDPR tool. The documentation claims
significant performance enhancements possible, and thus
if you are considering it already, then I would recommend
it if the applications you develop must have maximal
performance.
---------- ---------- ---------- --------- ---------- ----------
This item was created from library item Q676519 FFJFB
Additional search words:
AIX ALTERNATE COMPILERS FFJFB INDEX IX JAN95 OPTIMIZATION OPTIMIZE
OZNEW RISCL RISCSYSTEM SOFTWARE 4.1.X
WWQA: ITEM: RTA000052828 ITEM: RTA000052828
Dated: 06/1996 Category: RISCL
This HTML file was generated 99/06/24~12:43:20
Comments or suggestions?
Contact us