1. Overview and rationale#

The definition of real and double precision types by standard FORTRAN 77 does not make it possible to produce portable scientific software, with source uniqueness and comparable numerical performances [bib1], [bib2]. FORTRAN 90 integrates the concept of length when declaring the various types but unfortunately this does not provide any assurance as to the real precision used, the latter depending on the implementation of the compiler.

The IEEE -P745 standard [3] defines binary precision limits but is not applied by all:

type	significant figures
	always	sometimes
simple precision	6	7 or 8
double precision	15	16
simple extended	> 9
double extent	> 18

It is therefore necessary to make up for the shortcomings of standard FORTRAN 77, which does not set rules for representing numbers, although there are appropriate algorithms that make it possible to determine certain parameters dynamically.

This version of ENVIMA is the result of a previous achievement dating from 1990 which proved to be too abundant, many functions having never been used in Code_Aster. Their realization in FORTRAN has been replaced by a code written in C allowing them to be grouped together with all the functions that have adhesions to machines and/or operating systems. On this occasion, the complex type functions whose use was marginal, and which present a real difficulty in portability (the complex type existing in FORTRAN but not in C), were purely suppressed.

The software package ENVIMA includes several functions without arguments, called —EM, which allow access, from any FORTRAN routine, to the necessary parameters characterizing the machine on which the processing is performed.

The parameters are set statically in each version of the software package to:

integers: length, extreme values;
the logics: length;
floats: base of the representation system, length of the mantissa, relative precision, representable extreme values;
constants: particular constants (\(\pi\), NaN, .. );
files: size limits (linked to machine operating constraints).

Four function groups are available:

arithmetic parameters (definition of numbers),
settings for the use of the central memory,
parameters for the use of auxiliary memories (files),
constant values.

Some definitions:

Addressing unit: Each manufacturer defines a mode for addressing information in memory for a machine; the unit of measurement of this address is the addressing unit: the word was used in the past on some platforms, it is now the byte on most x86 processor-based workstations.
Length: Each type of variable is characterized by a machine representation length; this can be measured in bits, bytes, or addressing units. The JEVEUX memory manager used in*Code_Aster* requires this information when defining the type attribute of the created objects.

1.1. Arithmetic parameters#

1.1.1. Representation of whole numbers#

Four parameters are available for standard integer variables:

the length of an integer measured in bits, bytes, or addressing units;
the maximum number of significant figures to represent the number in decimal;
the maximum representable value, i.e. the largest positive integer \(i\) such that all the integers in the \([-i,+i]\) interval are represented by the integer type;
the range defined by the largest integer \(i\) such as:
- \(\mathrm{-}i\) be exact,
- for formula \(\mathrm{-}i<{i}_{a},{i}_{b}<+i\) the operation formula \(\mathrm{\mid }{i}_{a}\mathrm{\oplus }{i}_{b}\mathrm{\mid }\) with formula \(\oplus \in \{+,-,\ast \}\): be exact and does not exceed formula \(i\) in absolute value.

1.1.2. Representation of real numbers#

The definition of the standard real type in standard FORTRAN 77 does not make it possible to create portable software with source uniqueness and comparable numerical performances. To achieve this objective, we have chosen to use within the set of Code_Aster the non-standard type REAL *8, accepted by a large number of compilers and which leads to the most similar representations (64 bits on any platform).

We can give the following image of the machine representation of floating numbers:

\(x=\sigma {B}^{E}\sum _{k=1}^{N}x(k){B}^{-k}\) where \(x\) refers to a real number,

\(\sigma\) the sign,

\(B\) the representation base (2 most of the time),

\(E\) the exhibitor (\({E}_{\mathit{min}}\mathrm{\le }E<{E}_{\mathit{max}}\)),

\(N\) the number of digits allocated to the mantissa.

This representation obviously imposes \(0<x(1)<B\) and \(0<x(i)<B\) for \(1<i<N\). We can therefore see that two distinct real numbers \({x}_{1}\) and \({x}_{2}\) whose representation above is written with the same exponent \(E\) may differ by at least \({B}^{N}\). When the exponent differs by one unit, the difference between the two real numbers is at least \({B}^{1\mathrm{-}N}\).

The values \(B,N,{E}_{\mathit{min}}\) and \({E}_{\mathit{max}}\) have been entered into the software and can be retrieved by the appropriate function.

It is then easy to define the following characteristic values: the smallest positive real: \({B}^{{E}_{\mathit{min}}\mathrm{-}1}\), the largest positive real: \({B}^{{E}_{\mathit{max}}}(1\mathrm{-}{B}^{N})\), the smallest relative increment: \({B}^{\mathrm{-}N}\), the largest relative increment:, the largest relative increment: \({B}^{1\mathrm{-}N}\)

The available settings are:

the length in bits, bytes, or addressing units;
the maximum number of significant figures to represent the number in decimal;
the representation base B for floating numbers;
the length of the mantissa;
the relative precision is such that no real other than 1.0 is represented by: formula \(1.0\mathrm{-}{\varepsilon }_{1}<1.0<1.0+{\varepsilon }_{2}\) with \({\varepsilon }_{1}\mathrm{=}(1\mathrm{/}b){\varepsilon }_{2}\)
the representable positive extreme values: maximum (overflow) and minimum (underflow);
the range is defined by the greatest real such that if \({\varepsilon }_{\mathrm{1,}}{\varepsilon }_{\mathrm{2,}}{\varepsilon }_{3}\) are of the order of relative precision formula \({\varepsilon }_{1}\):
- formula \(\mathrm{-}x\) be correctly represented by formula \(-x(1.\pm {\varepsilon }_{1})\);
- for formula \(1/x<\mid a\mid ,\mid b\mid <x\) the operation formula \(a\mathrm{\oplus }b\) with \(\mathrm{\oplus }\mathrm{\in }\mathrm{\{}+,\mathrm{-},\mathrm{\ast },\mathrm{/}\mathrm{\}}\) is such that formula \(1/x<\mid a\oplus b\mid <x\) is correctly represented by formula \(a(1\pm {\varepsilon }_{2})\oplus b(1\pm {\varepsilon }_{3})\).

1.1.3. Representation of logics#

Only one parameter is required:

the length in bytes.

1.2. Special values#

The real NaN (Not a Number) value can be obtained on machines that support IEEE arithmetic. Historically on CRAY servers the value used was UNDEF, it could be applied to floats but also to integers. This value is used in some cases (bug search) to reset memory areas associated with objects managed by JEVEUX.

The assignment of variables by the NaN value then makes it possible to detect their use in floating operations because it immediately causes the code to stop with the emission of a signal (handler) that can be recovered.

1.4. File usage settings#

Two settings are available:

the maximum size in bytes of a file,
the maximum size in bytes of all open files.

They were introduced only because of the constraints associated with the exploitation of shared resources (on the centralized aster server, limitation of the temporary file space associated with a batch job) and are used in the management of the memory manager’s binary direct access files.

1.5. Mathematical constants#

A set of universal constants (optimal for the requested type) is provided to the user.

These constants are (currently):

the values of \(\pi\) and \(2\pi\),
the absolute zero value for the temperature,
radian/degree and degree/radian conversion parameters.