Compiling SOPALE

SOPALE uses the Fortran 77 source formatting, but it requires a Fortran 90 compiler because it uses Fortran 90 modules. SOPALE has been compiled on an IBM p690 system (using IBM XL Fortran for AIX, version 8.1) and on an Opteron system (using Intel Fotran Compiler for Linux, version 9.1)..

Our priorities when choosing compiler options for SOPALE are numerical consistency and speed, in that order. Here are the compiler options which we have used.

But see below for more recent results.

pre Dec 2007
Compiler Options Comments
IBM XL Fortran for AIX (version 8.1)
-O3 -qstrict -Q -qarch=auto -qtune=auto -qfloat=nomaf:nofold
-O3
Performs optimizations that are intended to offer improved performance; optimizations may be memory intensive, compile-time intensive, and may change the semantics of the program slightly. This level of optimization also affects the setting of the -qfloat option, turning on the fltint and rsqrt suboptions by default, and sets -qmaxmem=-1.
-qstrict
Ensures that optimizations done by the -O3, -qhot, and -qipa options do not alter the semantics of a Fortran 90 or Fortran 95 program.
-Q
Inlines all appropriate procedures, subject to limits on the number of inlined calls and the amount of code size increase as a result.
-qarch=auto
Automatically detects the specific architecture of the compiling machine and controls which instructions the compiler can generate. NOTE: on the p690, this will turn on the -qfloat=rndsngl option.
-qtune=auto
Automatically detects the specific processor type of the compiling machine and tunes instruction selection, scheduling, and other implementation-dependent performance enhancements for that hardware architecture.
-qfloat=nomaf:nofold
Floating point options: nomaf = Do not generate multiply-add instructions for floating-point calculations; nofold = Evaluates constant floating-point expressions at run time, not at compile time.
Intel Fortran Compiler for Linux (version 9.1.039)
-O2 -fp-model strict -convert big_endian -assume byterecl 
-O2
Optimize for speed (Intel recommended level of optimization) On Intel EM64T Windows systems, this turns on /Og (global optimizations), /Ot (optimize for code speed), /Ob2 (Enables inlining of any function at the compiler's discretion), and /Gs (Stack checking is disabled for routines with more than 4KB of stack space allocated).
-fp-model strict
Strict floating-point model: Tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations and enables floating-point exception semantics.
-convert big_endian
Specifies that the format of unformatted files containing numeric data will be big endian for integer data and big endian IEEE floating-point for real and complex data. (Required for consistency in output file format)
-assume byterecl
Units for OPEN statement RECL values with unformatted files are in byte units.

Floating Point Settings for SOPALE

On the p690 with the XL Fortran compiler, these are the floating point settings corresponding to the compiler options above:

nofltint Turns off the optimization which rounds double-precision values only when they are stored into REAL(4) memory locations
nofold Evaluates constant floating-point expressions at run time instead of compile time.
nohsflt Turns off optimization which prevents rounding for single-precision expressions and which replaces floating-point division by multiplication with the reciprocal of the divisor.
nohssngl Turns off optimization which rounds single-precision expressions only when the results are stored into REAL(4) memory locations.
nomaf Turns off optimization which uses multiply-add instructions for floating-point calculations.
nonans Turns off optimization that detects (at run time) operations that involve signaling NaN values (NaNS).
rndsngl Rounds the result of each single-precision (REAL(4)) operation to single-precision, rather than waiting until the full expression is evaluated. It sacrifices speed for consistency with results from similar calculations on other types of computers.
rrm Turns off compiler optimizations that require the rounding mode to be the default, round-to-nearest, at run time.
norsqrt Turns off optimizations that replace division by the result of a square root with multiplication by the reciprocal of the square root.

Testing, Dec 2007

In Dec 2007, a model was run for 10 timesteps, using the weighted-density code. Both the sopale code and blkfct were compiled in various ways, using the xlf 8.1 and xlf 10.1 compilers on the p690. The number that is used to evaluate convergence (otherwise known as the "e=" number) was extracted from resulting output files. Here are the numbers for time step 10, iteration 6. All models converged.

Sopale binaries and the "e=" number
binary name e=
SOPALE1_32_c081_0a.out 0.152828643883666793E-01
SOPALE1_32_c081_0.out 0.152828643883666793E-01
SOPALE1_32_c081_2a.out 0.152828643883666793E-01
SOPALE1_32_c081_3as.out 0.152828643883666793E-01
SOPALE1_32_c081_3Qatsf.out 0.152826349772674161E-01
SOPALE1_32_c081_3Qatsnof.out 0.152826596586396902E-01
SOPALE1_32_c081_3Qatsnom.out 0.152826349772674161E-01
SOPALE1_32_c081_3Qats.out 0.152828643883666793E-01
SOPALE1_32_c101_0a.out 0.152828643883666793E-01
SOPALE1_32_c101_0.out 0.152828643883666793E-01
SOPALE1_32_c101_2a.out 0.152828643883666793E-01
SOPALE1_32_c101_3as.out 0.152825680737751421E-01
SOPALE1_32_c101_3Qats.out 0.152825680737751421E-01
sopale_std.out 0.152828643883666793E-01

File Naming

The naming convention for both blkfct and SOPALE1_32_c* is the same. The blkfct's were compiled with 1 additional flag: -qhalt=e e.g. libblkfct_32_c081_0a.a was compiled with -O0 -qarch=auto -qhalt=e This flag should (IMHO) be used on the sopale source code as well; however at present we cannot, as sopale main has compile errors.

Sopale Binaries Used

SOPALE1_32_c081_0a    		compiled with xlf 8.1, options -O0 -qarch=auto
SOPALE1_32_c081_0     		compiled with xlf 8.1, options -O0
SOPALE1_32_c081_2a    		compiled with xlf 8.1, options -O2 -qarch=auto
SOPALE1_32_c081_3as   		compiled with xlf 8.1, options -O3 -qstrict -qarch=auto
SOPALE1_32_c081_3Qatsf		compiled with xlf 8.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto -qfloat=nomaf:nofold
SOPALE1_32_c081_3Qatsnof	compiled with xlf 8.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto -qfloat=nofold
SOPALE1_32_c081_3Qatsnom	compiled with xlf 8.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto -qfloat=nomaf
SOPALE1_32_c081_3Qats 		compiled with xlf 8.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto
SOPALE1_32_c101_0a    		compiled with xlf10.1, options -O0 -qarch=auto
SOPALE1_32_c101_0     		compiled with xlf10.1, options -O0 
SOPALE1_32_c101_2a    		compiled with xlf10.1, options -O2 -qarch=auto
SOPALE1_32_c101_3as   		compiled with xlf10.1, options -O3 -qstrict -qarch=auto
SOPALE1_32_c101_3Qats 		compiled with xlf10.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto
sopale_std			compiled with xlf 8.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto

Blkfct Used

Each SOPALE1_32_c* was linked with a corresponding blkfct.  
  e.g.              
                       SOPALE1_32_c081_0a 
  was linked wtih 
    ~beaumnt1/blkfct/libblkfct_32_c081_0a.a
sopale_std was linked with ~beaumnt1/blkfct/libblkfct_std.a

Elpased Execution Times

The elapsed times were recorded for some of the runs. The times seem to fall into one of two categories: 3 minutes and 13 minutes. At optimization level 0, the execution times are all about 13 minutes. At optimization level 2 and above, the times are about 3 minutes. This behaviour is similar for both compilers.

Elapsed Execution Times
Binary Name Start Time End Time
SOPALE1_32_c081_0a.out Thu Dec 6 11:11:30 AST 2007 Thu Dec 6 11:24:28 AST 2007
SOPALE1_32_c081_0.out Thu Dec 6 11:11:30 AST 2007 Thu Dec 6 11:24:36 AST 2007
SOPALE1_32_c081_2a.out Thu Dec 6 11:11:30 AST 2007 Thu Dec 6 11:14:36 AST 2007
SOPALE1_32_c081_3as.out Thu Dec 6 11:11:30 AST 2007 Thu Dec 6 11:14:54 AST 2007
SOPALE1_32_c081_3Qats.out Thu Dec 6 11:11:30 AST 2007 Thu Dec 6 11:14:55 AST 2007
SOPALE1_32_c101_0a.out Thu Dec 6 11:11:30 AST 2007 Thu Dec 6 11:24:28 AST 2007
SOPALE1_32_c101_0.out Thu Dec 6 11:11:30 AST 2007 Thu Dec 6 11:24:34 AST 2007
SOPALE1_32_c101_2a.out Thu Dec 6 11:11:30 AST 2007 Thu Dec 6 11:14:41 AST 2007
SOPALE1_32_c101_3as.out Thu Dec 6 11:11:30 AST 2007 Thu Dec 6 11:14:21 AST 2007
SOPALE1_32_c101_3Qats.out Thu Dec 6 11:11:30 AST 2007 Thu Dec 6 11:14:22 AST 2007

Conclusions

Both compilers produce a remarkable increase in execution speed when changing from optimization level 0 to level 2 and above. Levels above 3 were not tried.

The xlf 8.1 compiler produces consistent results up to optimization level 3, providing the indicated compiler options are used. The xlf 10.1 compiler produces the same results at optimization levels 0 and 2. At level 3, the numbers start to vary.

For p690.ucis.dal.ca, Dec 2007
Compiler Options Comments
xlf 8.1 -O3 -qstrict -Q -qarch=auto -qtune=auto produces consistent results
xlf 10.1 Not recommended. Initial tests indicate that it may be OK up to optimization level 2. Has not been sufficiently tested for us to be confident that it will produce results that are consistent with xlf 8.1 . Those using xlf 10.1 are on their own.