6. A few tips for optimizing performance#

Here we provide some tips to help the user take advantage of the diagnoses traced in the message file. But it is important to be aware that there is no universal recipe for optimizing the overall performance of a calculation. It depends on the type of study, on the software and hardware aspects of the machine, and even on its load!

The default settings and code displays/alarms offer balanced and calibrated operation. But, to be sure that he has made the best use of the capabilities of his machine, the user must remain attentive to the elements described in this document as well as to the advice that is abundant in the order documentation.

We list below, and in a non-exhaustive way, several questions that are interesting to ask yourself when trying to optimize the performance of your calculation. Of course, some questions (and answers) are cumulative and can therefore be applied simultaneously.

6.1. Regarding the characteristics of the problem#

In view of the elements of §3, two rules of thumb can be formulated:

        • As the**problem size**(:math:`N`) and/or**matrix filling* (\(\mathit{NNZ}\)) increase, the more expensive it will be to build, and above all, to solve the linear system (CPU/RAM).

        • Increasing the**Lagrange proportions* (\({N}_{L}\mathrm{/}N\)) may make it more difficult to solve the linear system (execution time, quality of the solution).

        • The size of the problem**dimensions the maximum number of processors* that it is relevant to devote to its parallel calculation: a granularity of at least 20,103 degrees of freedom per process MPI is required.

6.2. Regarding the time consumed#

To reduce CPU times, the Aster user has various tools:

    • If**most of the costs concern elementary calculations/assemblies and/or the resolutions of linear systems**(see §4.1) we recommend using*Aster in parallel mode. It is then preferable to use the linear solver MUMPS in distributed MPI mode or the solver MULT_FRONT in OpenMP. The first strategy also makes it possible to reduce RAM consumption per processor.

    • If you**already use the linear solver*** MUMPS ****, you can deactivate its OOC [12] _ features

(GESTION_MEMOIRE =” IN_CORE “) and to improve the quality of the solution (RESI_RELA =-1.d0). If the matrix is well conditioned and/or not symmetric, it is also possible to try relaxation parameters of the linear solver (FILTRAGE_MATRICE, MIXER_PRECISION, SYME).

    • If you perform**a nonlinear calculation* you can test various relaxation parameters of the nonlinear solver (REAC_INCR, REAC_ITER, SYME).

    • If a**modal calculation**is performed, it is recommended to use the IRAM method (METHODE =” SORENSEN “) and to split the required spectrum into several frequency bands (*via the CALC_MODES operator with OPTION =” BANDE” divided into several sub-bands).

    • In general, the more**the operating mode of*** JEVEUX **** (and MUMPS) is in IC (see §5.1), the faster the calculation is. However, these gains are not very significant compared to those provided by parallelism and the choice of a suitable linear solver (with parameterization).

For each step of the calculation, we should normally have a low system time (SYST) and an accumulation of « CPU time +system time » (CPU + SYST) very close to the real waiting time (ELAPS). If this is not the case, two classical scenarios may occur:

      • The time **** USER + SYSest much greater than time**** ELAPS **. The calculation certainly uses the OpenMP parallelism of the parallel strategies 1c or 2a described in the doc. U2 on parallelism and in §2.2 of the doc. U4.50.01. This situation is not worrying (and even intended), the main thing being that the return time of the calculation is as low as possible!

      • Times ELAPSest much greater than time CPUet /or time SYSest important. The calculation is probably affected by memory constraints (system swap, I/O RAM /disk…).

  • Track 1: This additional cost may come from global unloadings of JEVEUX . To be convinced of this, all you have to do is read, at the end of .mess, the statistics concerning dynamic allocation (see §5.2) or the consumption in time of the various unloads (see §4.1). The more calls to the release mechanisms and/or the more large objects released, the more the times SYST and ELAPS will worsen.**One solution is then to increase the memory size allocated to** JEVEUX **.

  • Track 2: Corollary of the previous observation, parallel mode MPI tends to increase the additional costs due to unloading tenfold. Indeed, the distribution of data induced by parallelism will reduce the size of JEVEUX objects (if SOLVEUR/MATR_DISTRIBUEE is activated) and therefore limit the impact of unloads. Conversely, these may take place at the same time and on contiguous processors.**A palliative solution**can then consist in « wasting » the processor,**by interleaving active MPI processors with dormant processors**(e.g. ncpus value of*Astk initialized to 2).

  • Track 3: If we use the linear solver MUMPS in OOC, the problem may come from a large number of descents and ups of the product (cf. step 1.4 of §4.1). They can be limited by unplugging the automatic refinement option (RESI_RELA =-1.d0) or by going back to IC mode (if memory resources allow it).

6.3. Regarding the RAM memory consumed#

  • If \({\mathit{JE}}_{\mathit{IC}}\) and \({\mathit{JE}}_{\mathit{OOC}}\) consumption are similar (a few dozen percent), it’s because JEVEUX often had to operate in OOC mode. There were probably numerous memory unloads (cf. §4.1/5.2). Computing time can suffer (especially in parallel). You should try to add more memory or increase the number of processors (if option MATR_DISTRIBUEE is activated).

The**calculation is often sized in memory*, by the maximum, on the most intensive operator, between the floor value of JEVEUX (\({\mathit{JE}}_{\mathit{OOC}}\)) and the value used by the linear solver MUMPS (if used). To reduce this figure, several factors can be used:

  • If you use MUMPS ****and the latter is predominant**(this is often the case): use more processors in MPI (parameters mpi_nbcpu/mpi_nb*Astk* node), switch to OOC (keyword SOLVEUR/GESTION_MEMOIRE). If the matrix is well conditioned and/or not symmetric, it is also possible to try relaxation parameters of the linear solver (FILTRAGE_MATRICE, MIXER_PRECISION, SYME).

  • If using MUMPS ****and the latter is not predominant: ** use the JEVEUX object distribution from option MATR_DISTRIBUEE in parallel mode.

If you use a**other linear solver*: use MUMPS in parallel or even GCPC/PETSC (in sequential/parallel).

6.4. Regarding parallelism#

If**most of the costs concern elementary calculations/assemblies and/or the resolutions of linear systems**(see §4.1) we recommend using*Aster in parallel mode. It is then preferable to use the linear solver MUMPS in distributed MPI mode (cf. doc. U2 on parallelism or U4.50.01) or the MULT_FRONT solver in OpenMP.

  • It is preferable to limit the parallelism of the calculation only to the few expensive operators (in time/memory): STAT/DYNA_NON_LINE, CALC_MODES… And therefore, if possible, to divide this one into a succession of sequential pre/post-treatments and parallel calculations. During long calculations (a few days), this strategy also makes it possible to better protect yourself against possible stops in error by saving the base associated with each major stage of the calculation.

It is interesting to**validate*, beforehand,**its parallel calculation**by comparing some iterations in sequential mode and in parallel mode. In addition, this approach also makes it possible to**calibrate the maximum achievable gains**(theoretical speed-up) and therefore to avoid « wasting too many processors ». Thus, if we note \(f\) the parallel portion of the code (determined for example*via* prior sequential run), then the maximum theoretical speed-up \({\mathrm{S}}_{\mathrm{p}}\) accessible on \(p\) processors is calculated according to the Amdhal formula (cf. [R6.01.03] §2.4):

\({\mathrm{S}}_{\mathrm{p}}\mathrm{=}\frac{1}{1\mathrm{-}\mathrm{f}+\mathrm{f}\mathrm{/}\mathrm{p}}\).

For example, if we use distributed parallelism by default (1b+2b scenarios, cf. doc. U2 on parallelism) and that steps 1.3/1.4 and 2. (see §4.1) represent 90% of the sequential time (\(f\mathrm{=}0.90\)), the theoretical speed-up is limited to the value \({\mathrm{S}}_{\mathrm{\infty }}\mathrm{=}\frac{1}{1\mathrm{-}0.9+0.9\mathrm{/}\mathrm{\infty }}\mathrm{=}10\)! And this, regardless of the number of MPI processes allocated.

To optimize parallel computing, it is necessary to monitor any**load imbalances**in the data flow (CPU and memory) and limit the additional costs due to**memory unloading**(JEVEUX and MUMPS OOC) (see and) (see §4.1) (see §4.1) and**field archiving* (see §4.3). To save computation time, it is also necessary to avoid any memory reset procedure (command FIN keyword RETASSAGE) that is counterproductive in parallel.

The**use of parallelism MPI * by MUMPS saves time CPU (on the parallelized steps) and memory RAM: thanks to the data distribution of JEVEUX (if option MATR_DISTRIBUEE is activated) and, above all, thanks to that of the objects MUMPS.

  • Some empirical figures: we recommend allocating at least 20,000 degrees of freedom per process MPI; A standard thermo-mechanical calculation generally benefits, on 32 processors, from a gain of the order of ten in elapsed time and a factor of 4 in memory RAM.