Components Overview

DLB is a collection of tools to improve the load balance of HPC hybrid applications (i.e., two levels of parallelism).

It provides components like LeWI or DROM, that are able to change the resource configuration at runtime. Additionally, it contains the profiler TALP, that easily allows to capture performance metrics.

To understand, which applications can benefit from LeWI or DROM, we will have a look at the structure of a typical hybrid HPC application. As depicted in the picture below, we can see an application that will run processes on several nodes, and each process will spawn several threads.

_images/hpc_app.png

Typical hybrid application structure

DLB focuses on improving the load balance of the outer level of parallelism (e.g MPI) by redistributing the computational resources at the inner level of parallelism (e.g. OpenMP). This readjustment of resources will be done dynamically at run time.

This dynamism allows DLB to react to different sources of imbalance: Algorithm, data, hardware architecture and resource availability among others.

LeWI: Lend When Idle

The main load balancing algorithm used in DLB is called LeWI (Lend When Idle). The idea of the algorithm is to use the computational resources that are not being used for useful computation to speed up processes in the same computational node.

To achieve this DLB will lend the CPUs of a process waiting in a blocking MPI call to another process running in the same node.

_images/LeWI.png

Application without LeWI (left) and with LeWI (right)

DROM: Dynamic Resource Ownership Manager

DROM offers an API to change the computational resources assigned to a process at run time. This can be useful if the application detects that it cannot use these resources efficiently and decides to release some of them. This component can be used by an external entity like a job scheduler or resource manager to reallocate the resource mitigating coarse-grain load imbalance.

_images/drom.png

DROM moving CPU resources between applications

TALP: Tracking Application Live Performance

TALP is another module included in DLB that allows to measure the parallel efficiency and other performance metrics. The data obtained by TALP is available at run time during the execution, or as a report at the end.

_images/talp.png

TALP usage example: The collected metrics can be reported through different methods.