C++ OPTIMIZATION WITH AIX 4.1.X

ITEM: RTA000052828



QUESTION:                                                                       
A question came up from a developer on C Set ++ for AIX 4.1.X (5765-            
421).  I know that the compiler will now optimize the C for a per-              
formance improvement, but on average, what percentage improvement can           
you expect (what have we seen)?                                                 
                                                                                
Another question is, can you give me an idea of how well C++ assists            
in decomposing the application into multi-threads (to take advantage            
of our SMP)?  The fdpr tool that's packaged with IBM Performance Aide           
2.1 (5696-899) has a performance optimizer that can greatly improve             
things, but does it address SMP optimization and programming structure          
assistance?  Also, how does C++ do without the toolbox and optimizer.           
                                                                                
What have you seen to give the best results, both uni and smp??                 
                                                                               
---------- ---------- ---------- --------- ---------- ----------                
A: 1) Q: I know that the compiler will now optimize the C for a                 
         performance improvement...                                             
      A: This has always been the case.  The RISC System/6000 CPU               
         development team worked closely with the compiler                      
         development team; thus, the compiler developers were                   
         able to communicate what they needed to the people                     
         developing the the hardware.  Normally companies work in               
         the reverse order - they develop compiler backends to                  
         optimize code to a given architecture.  In contrast,                   
         IBM, by and large, developed a highly optimizing compiler              
         and then developed a hardware which implemented the                    
         resulting machine code.  The result is one of the best                 
         optimizing compilers in the world.                                     
                                                                               
   2) Q: ...on average, what percentage improvement can you expect?             
      A: Percentage improvement over what?  If you are asking what              
         percentage improvement does the new compiler offer over                
         the AIX 3.2.5 compiler, then I must inform you that this               
         information is not available.                                          
                                                                                
         If you would like to know how much faster optimized C                  
         code runs on AIX 4.1 as compared to unoptimized C code,                
         then this is highly dependent on your application.                     
         Some applications may see little benefit from                          
         optimization, while others may run 5 times faster with                 
         optimization.                                                          
                                                                                
         IBM has not published benchmarks comparing performance                 
         of C programs with and without optimization, but                      
         I executed a small test on an AIX 4.1 machine to reflect               
         the effect of the optimizer on numerical applications.                 
         I tested the amount of time it takes to calculate the                  
         dot product of 2 500,000-element double precision                      
         vectors.  This operation is one of the cornerstones of                 
         numerical computation (numerical computation is one of                 
         the primary clients of high performance computing).  I                 
         wrote the application with no by-hand optimization, and                
         I compiled it with and without the -O3 optimization                    
         flag.  The optimized version ran about 45 percent faster               
         (nearly twice as fast) as the unoptimized version.                     
                                                                                
         I then added some simple optimizations by hand to the                  
         dot product operation.  The code produced by -O3 option                
         was 33 percent faster than my own optimized code.                     
                                                                                
         I would like to repeat the following points for                        
         emphasis:                                                              
                                                                                
           - It is not sensible to speak of average performance                 
             improvements resulting from optimized code.  We can                
             develop some sort of average improvements to be                    
             expected by classes of programs, like numerically                  
             intensive programs or database programs.  An average               
             of all these application classes would not provide a               
             useful number since a typical customer runs a large                
             proportion of applications from a single class.                    
                                                                                
           - IBM has not published the information you need.  The               
             test that I ran may not even prove to be typical of               
             the class of numerical applications.  If you must                  
             have more reliable information, you can either find                
             copies of benchmark source code (for database-class                
             benchmarking I would recommend obtaining the TPC                   
             benchmarks and for numerical-class benchmarking I                  
             would recommend obtaining the LINPACK or the SPECfp                
             benchmarks) or contacting IBM's benchmarking center                
             to set up a contract to perform these benchmarks for               
             you.  If you would like to set up such a contract,                 
             then please respond back.                                          
                                                                                
   3) Q: How well does C++ assist in decomposing applications                   
         into multi-threads (to take advantage of our SMP)?                     
                                                                                
      A: The C++ compiler does not assist decomposing applications             
         into multiple threads.  The compiler will not generate a               
         threaded machine code program from a nonthreaded C++                   
         program.                                                               
                                                                                
   4) Q: Does the FDPR tool address SMP optimization and                        
         programming structure assistance?                                      
                                                                                
      A: No, FDPR does not address SMP optimization.  All                       
         optimizers address programming structure assistance in                 
         that they restructure your code.  Optimizers will                      
         attempt to move apart instructions which depend on each                
         other, thereby allowing parallel execution in a single                 
         CPU.  They will also unroll loops, instantiate small                   
         functions inline, delete unnecessary code, and combine                 
         multiple operations into a single instruction (for                    
         example, RISC System compilers can combine a multiply                  
         operation and an add operation into a single                           
         multiply-add machine instruction).                                     
                                                                                
         The FDPR's advantage over regular optimization is,                     
         among other things, the ability to reorder instructions                
         globally.  Normal optimizers may reorder the execution                 
         of instructions within a single function.  The FDPR                    
         has the ability to determine if it is beneficial to                    
         reorder instructions which lie on opposite sides of                    
         function boundaries.  This is a full order of magnitude                
         mode difficult than normal optimization.                               
                                                                                
   6) Q: How does C++ do without the toolbox and optimizer?                     
                                                                               
      A: I am not certain of what information you require here.                 
         I assume that you would like a comparison between code                 
         compiled with and without the FDPR optimizer.                          
                                                                                
         IBM's published comparison is as follows:                              
           - FDPR improves performance up to 73 percent (typically              
             10 - 20 percent).                                                  
           - FDPR reduces text memory requirements up to 61                     
             percent (typically 20 - 30 percent).                               
                                                                                
         I realize that I may have misinterpreted this question.                
         If so, please respond back with a clarification.  If I                 
         have indeed misinterpreted this question, then I am                    
         sorry for the resulting delay.                                         
                                                                               
   8) Q: What have you seen to give the best results, both                      
         uniprocessor and SMP?                                                  
                                                                                
      A: The SMP will run a single process faster if its code is                
         threaded.  The SMP allocates work in thread granularity.               
         Different processes and different threads of the same                  
         process can run on different processors simultaneously.                
         Thus, developing a threaded process will make a process                
         run faster on an SMP machine (so long as the threads                   
         spend most of the time executing simultaneously).                      
         Splitting a single process application into multiple                   
         processes will make an application run faster on an SMP                
         machine (so long as the processes spend most of their                  
         time executing simultaneously).                                        
                                                                               
         Since our thread support only goes so far as to provide                
         the ability to develop threaded code, all other issues                 
         relate both to uniprocessors and multiprocessors.                      
                                                                                
         The "-O3" option is outstanding.  I have found that I                  
         cannot develop code by hand that is as efficient as the                
         code output by the compiler with this option.                          
                                                                                
         The "-qarch=" option also helps significantly in many                  
         situations, especially on PowerPC architectures.  Use                  
         "-qarch=ppc" for PowerPCs, "-qarch=pwr" for POWER                      
         architecture CPUs, and "-qarch=pwr2" for POWER2s.  Code                
         compiled with this option will not run on all RISC                     
         Systems - it will only run on a machine with the target                
         architecture (actually, "-qarch=pwr" will run on both                 
         POWER and POWER2).                                                     
                                                                                
         I have not used the FDPR tool.  The documentation claims               
         significant performance enhancements possible, and thus                
         if you are considering it already, then I would recommend              
         it if the applications you develop must have maximal                   
         performance.                                                           
                                                                                
---------- ---------- ---------- --------- ---------- ----------                
                                                                                
                                                                                
This item was created from library item Q676519      FFJFB                      
                                                                                
Additional search words:                                                        
AIX ALTERNATE COMPILERS FFJFB INDEX IX JAN95 OPTIMIZATION OPTIMIZE             
OZNEW RISCL RISCSYSTEM SOFTWARE 4.1.X                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                               


WWQA: ITEM: RTA000052828 ITEM: RTA000052828
Dated: 06/1996 Category: RISCL
This HTML file was generated 99/06/24~12:43:20
Comments or suggestions? Contact us