Technical Requirements
This is a list of technical requirements that the DLB user should be aware of. Not all of these are strictly necessary and will depend on the application or programming model used, but it is important to know all these factors for a successful DLB execution.
Programming models
DLB was originally designed to alleviate load imbalance issues in HPC applications using two programming models simultaneously: MPI for the outer level of parallelism, and any thread-based programming model for the inner level, such as OpenMP or OmpSs.
But technically, none of the above is strictly necessary. An application could have its own thread management and still use DLB, as long as the application communicates with the library to share resources dynamically. MPI is not necessary either, DLB supports resource sharing between different processes, even if they are not the same application (or they are, but DLB is not aware of it). Nevertheless, DLB uses MPI to detect which process is more unbalanced, and without this information, the application has to provide it to DLB via its API.
Preload mechanism for MPI applications
DLB implements the MPI interface to capture each MPI call made by the application, and then calls the original MPI function using the PMPI profiling interface.
To enable MPI support for DLB, the application must either be linked or
preloaded with the DLB MPI library libdlb_mpi.so. Using
the preload mechanism has the advantage that the application binary is the
same as the original, and it only requires setting an environment variable
before execution.
Use the LD_PRELOAD environment variable to specify the MPI DLB library.
The library libdlb_mpi.so should intercept both C and Fortran MPI symbols.
In case of problems, check DLB configure flags to
build separate libraries.
Non-busy-waiting for MPI calls
Typically, MPI implementations have at least two ways of waiting for a blocking MPI call: busy-waiting and non-busy-waiting; or even a parameterised hybrid mode. In busy-waiting mode, the thread calling the MPI function repeatedly checks if the MPI call is finished, while in non-busy-waiting mode the thread can use other techniques to block its execution and not overuse the current CPU.
By default, DLB shares all CPUs whose associated thread is blocked in MPI.
This means that even if a thread is in MPI busy-waiting mode, this CPU can be
assigned to another process. To avoid sharing the CPU while it is executing an
MPI blocking call, use the --lewi-keep-one-cpu option.
Parallel regions in OpenMP
OpenMP (5.2 and below) parallelism is based on a fork-join model where threads are created (virtually or assigned to a team) when the running thread encounters a parallel construct.
This model limits DLB because it can only assign CPUs to a process just before a parallel construct. For this reason, an application with very few parallel constructs may get minimal benefit from DLB.
This limitation may be overcome with the upcoming OpenMP 6.0 and its free-agent threads, where tasks can be assigned to threads that may be created later and not part of a parallel region.