SOPALE uses the Fortran 77 source formatting, but it requires a Fortran 90 compiler because it uses Fortran 90 modules. SOPALE has been compiled on an IBM p690 system (using IBM XL Fortran for AIX, version 8.1) and on an Opteron system (using Intel Fotran Compiler for Linux, version 9.1)..
Our priorities when choosing compiler options for SOPALE are numerical consistency and speed, in that order. Here are the compiler options which we have used.
But see below for more recent results.
Compiler | Options | Comments |
---|---|---|
IBM XL Fortran for AIX (version 8.1) | -O3 -qstrict -Q -qarch=auto -qtune=auto -qfloat=nomaf:nofold |
-O3Performs optimizations that are intended to offer improved performance; optimizations may be memory intensive, compile-time intensive, and may change the semantics of the program slightly. This level of optimization also affects the setting of the -qfloat option, turning on the fltint and rsqrt suboptions by default, and sets -qmaxmem=-1. -qstrictEnsures that optimizations done by the -O3, -qhot, and -qipa options do not alter the semantics of a Fortran 90 or Fortran 95 program. -QInlines all appropriate procedures, subject to limits on the number of inlined calls and the amount of code size increase as a result. -qarch=autoAutomatically detects the specific architecture of the compiling machine and controls which instructions the compiler can generate. NOTE: on the p690, this will turn on the -qfloat=rndsngl option. -qtune=autoAutomatically detects the specific processor type of the compiling machine and tunes instruction selection, scheduling, and other implementation-dependent performance enhancements for that hardware architecture. -qfloat=nomaf:nofoldFloating point options: nomaf = Do not generate multiply-add instructions for floating-point calculations; nofold = Evaluates constant floating-point expressions at run time, not at compile time. |
Intel Fortran Compiler for Linux (version 9.1.039) | -O2 -fp-model strict -convert big_endian -assume byterecl |
-O2Optimize for speed (Intel recommended level of optimization) On Intel EM64T Windows systems, this turns on /Og (global optimizations), /Ot (optimize for code speed), /Ob2 (Enables inlining of any function at the compiler's discretion), and /Gs (Stack checking is disabled for routines with more than 4KB of stack space allocated). -fp-model strictStrict floating-point model: Tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations and enables floating-point exception semantics. -convert big_endianSpecifies that the format of unformatted files containing numeric data will be big endian for integer data and big endian IEEE floating-point for real and complex data. (Required for consistency in output file format) -assume bytereclUnits for OPEN statement RECL values with unformatted files are in byte units. |
On the p690 with the XL Fortran compiler, these are the floating point settings corresponding to the compiler options above:
nofltint | Turns off the optimization which rounds double-precision values only when they are stored into REAL(4) memory locations |
nofold | Evaluates constant floating-point expressions at run time instead of compile time. |
nohsflt | Turns off optimization which prevents rounding for single-precision expressions and which replaces floating-point division by multiplication with the reciprocal of the divisor. |
nohssngl | Turns off optimization which rounds single-precision expressions only when the results are stored into REAL(4) memory locations. |
nomaf | Turns off optimization which uses multiply-add instructions for floating-point calculations. |
nonans | Turns off optimization that detects (at run time) operations that involve signaling NaN values (NaNS). |
rndsngl | Rounds the result of each single-precision (REAL(4)) operation to single-precision, rather than waiting until the full expression is evaluated. It sacrifices speed for consistency with results from similar calculations on other types of computers. |
rrm | Turns off compiler optimizations that require the rounding mode to be the default, round-to-nearest, at run time. |
norsqrt | Turns off optimizations that replace division by the result of a square root with multiplication by the reciprocal of the square root. |
In Dec 2007, a model was run for 10 timesteps, using the weighted-density code. Both the sopale code and blkfct were compiled in various ways, using the xlf 8.1 and xlf 10.1 compilers on the p690. The number that is used to evaluate convergence (otherwise known as the "e=" number) was extracted from resulting output files. Here are the numbers for time step 10, iteration 6. All models converged.
binary name | e= |
---|---|
SOPALE1_32_c081_0a.out | 0.152828643883666793E-01 |
SOPALE1_32_c081_0.out | 0.152828643883666793E-01 |
SOPALE1_32_c081_2a.out | 0.152828643883666793E-01 |
SOPALE1_32_c081_3as.out | 0.152828643883666793E-01 |
SOPALE1_32_c081_3Qatsf.out | 0.152826349772674161E-01 |
SOPALE1_32_c081_3Qatsnof.out | 0.152826596586396902E-01 |
SOPALE1_32_c081_3Qatsnom.out | 0.152826349772674161E-01 |
SOPALE1_32_c081_3Qats.out | 0.152828643883666793E-01 |
SOPALE1_32_c101_0a.out | 0.152828643883666793E-01 |
SOPALE1_32_c101_0.out | 0.152828643883666793E-01 |
SOPALE1_32_c101_2a.out | 0.152828643883666793E-01 |
SOPALE1_32_c101_3as.out | 0.152825680737751421E-01 |
SOPALE1_32_c101_3Qats.out | 0.152825680737751421E-01 |
sopale_std.out | 0.152828643883666793E-01 |
The naming convention for both blkfct and SOPALE1_32_c* is the same. The blkfct's were compiled with 1 additional flag: -qhalt=e e.g. libblkfct_32_c081_0a.a was compiled with -O0 -qarch=auto -qhalt=e This flag should (IMHO) be used on the sopale source code as well; however at present we cannot, as sopale main has compile errors.
SOPALE1_32_c081_0a compiled with xlf 8.1, options -O0 -qarch=auto SOPALE1_32_c081_0 compiled with xlf 8.1, options -O0 SOPALE1_32_c081_2a compiled with xlf 8.1, options -O2 -qarch=auto SOPALE1_32_c081_3as compiled with xlf 8.1, options -O3 -qstrict -qarch=auto SOPALE1_32_c081_3Qatsf compiled with xlf 8.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto -qfloat=nomaf:nofold SOPALE1_32_c081_3Qatsnof compiled with xlf 8.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto -qfloat=nofold SOPALE1_32_c081_3Qatsnom compiled with xlf 8.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto -qfloat=nomaf SOPALE1_32_c081_3Qats compiled with xlf 8.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto SOPALE1_32_c101_0a compiled with xlf10.1, options -O0 -qarch=auto SOPALE1_32_c101_0 compiled with xlf10.1, options -O0 SOPALE1_32_c101_2a compiled with xlf10.1, options -O2 -qarch=auto SOPALE1_32_c101_3as compiled with xlf10.1, options -O3 -qstrict -qarch=auto SOPALE1_32_c101_3Qats compiled with xlf10.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto sopale_std compiled with xlf 8.1, options -O3 -qstrict -Q -qarch=auto -qtune=auto
Each SOPALE1_32_c* was linked with a corresponding blkfct. e.g. SOPALE1_32_c081_0a was linked wtih ~beaumnt1/blkfct/libblkfct_32_c081_0a.a sopale_std was linked with ~beaumnt1/blkfct/libblkfct_std.a
The elapsed times were recorded for some of the runs. The times seem to fall into one of two categories: 3 minutes and 13 minutes. At optimization level 0, the execution times are all about 13 minutes. At optimization level 2 and above, the times are about 3 minutes. This behaviour is similar for both compilers.
Binary Name | Start Time | End Time |
---|---|---|
SOPALE1_32_c081_0a.out | Thu Dec 6 11:11:30 AST 2007 | Thu Dec 6 11:24:28 AST 2007 |
SOPALE1_32_c081_0.out | Thu Dec 6 11:11:30 AST 2007 | Thu Dec 6 11:24:36 AST 2007 |
SOPALE1_32_c081_2a.out | Thu Dec 6 11:11:30 AST 2007 | Thu Dec 6 11:14:36 AST 2007 |
SOPALE1_32_c081_3as.out | Thu Dec 6 11:11:30 AST 2007 | Thu Dec 6 11:14:54 AST 2007 |
SOPALE1_32_c081_3Qats.out | Thu Dec 6 11:11:30 AST 2007 | Thu Dec 6 11:14:55 AST 2007 |
SOPALE1_32_c101_0a.out | Thu Dec 6 11:11:30 AST 2007 | Thu Dec 6 11:24:28 AST 2007 |
SOPALE1_32_c101_0.out | Thu Dec 6 11:11:30 AST 2007 | Thu Dec 6 11:24:34 AST 2007 |
SOPALE1_32_c101_2a.out | Thu Dec 6 11:11:30 AST 2007 | Thu Dec 6 11:14:41 AST 2007 |
SOPALE1_32_c101_3as.out | Thu Dec 6 11:11:30 AST 2007 | Thu Dec 6 11:14:21 AST 2007 |
SOPALE1_32_c101_3Qats.out | Thu Dec 6 11:11:30 AST 2007 | Thu Dec 6 11:14:22 AST 2007 |
Both compilers produce a remarkable increase in execution speed when changing from optimization level 0 to level 2 and above. Levels above 3 were not tried.
The xlf 8.1 compiler produces consistent results up to optimization level 3, providing the indicated compiler options are used. The xlf 10.1 compiler produces the same results at optimization levels 0 and 2. At level 3, the numbers start to vary.
Compiler | Options | Comments |
---|---|---|
xlf 8.1 | -O3 -qstrict -Q -qarch=auto -qtune=auto
| produces consistent results |
xlf 10.1 | Not recommended. Initial tests indicate that it may be OK up to optimization level 2. Has not been sufficiently tested for us to be confident that it will produce results that are consistent with xlf 8.1 . Those using xlf 10.1 are on their own. |