DLB_Barrier
The DLB Barrier is a synchronisation method, like an MPI Barrier but at the node level and without a communicator. Any thread that reaches the barrier will stop its execution and cannot proceed until all participants in the node have reached the barrier.
Its advantage over an MPI Barrier is that it uses a pthread_barrier on the
DLB shared memory to minimise the number of CPU cycles consumed while the
process is blocked. MPI implementations usually have options to make MPI
blocking calls less CPU intensive, but firstly they still consume more CPU
cycles than a pthread_barrier, and secondly it may negatively affect the
response time of all MPI calls in the application.
Use cases
Its first use case is to enable LeWI without having to intercept MPI calls. When LeWI is enabled, DLB barriers are also treated as blocking calls and DLB will try to reallocate resources blocked in the barrier as needed.
Another use case is to synchronise coupled applications that do not need to work concurrently and could potentially overlap the same computational resources. Especially in HPC, it is common to have two applications running time steps in an interleaved manner, where a time step of one application provides the input or needs the output for the other application. MPI Barrier can be used in these cases, but when applications overlap, the blocked application can cause a lot of noise by oversubscribing the CPU.
See distributed examples for two examples of the DLB Barrier, one with a regular barrier and another with named barriers. In both examples, two applications overlap on the same resources and synchronise their computation phase while the other is blocked in the barrier.
API usage
Basic usage: invoke DLB_Barrier() to perform a barrier for all the
processes in the node:
#include <dlb.h>
// Requires DLB initialisation either via:
// - API: DLB_Init(...)
// - MPI interception: LD_PRELOAD=libdlb_mpi.so
// Block current process until all the other process in the node attached
// to the DLB Barrier system reach the same barrier.
DLB_Barrier();
By default, all processes that have initialised DLB are attached to the barrier
system, i.e. all processes are participants in the barrier. Sometimes in MPI
applications, a process may behave differently from other MPI ranks and may not
execute the same code as the others. In this case, and to avoid deadlocks, a
process may detach itself from the system with DLB_BarrierDetach() so that
the other participants do not have to wait for it. To undo the change, a
process may call DLB_BarrierAttach() to become a member of the barrier
again:
if (this_rank_behaves_differently) {
DLB_BarrierDetach();
// This process will no longer participate in DLB_Barrier calls
}
...
if (revert_detach) {
DLB_BarrierAttach();
// If previously detached, this process will again participate in barriers
}
It can also be useful to have different barriers if you have different groups of processes in the same node doing the same thing. Similar to MPI communicators, named barriers will only wait for the processes that have registered the barrier with the same name:
// Register a barrier named "Another barrier". Named barriers can
// also be configured to not enable LeWI when the process is blocked.
dlb_barrier_t *named_barrier = DLB_BarrierNamedRegister(
"Another barrier", DLB_BARRIER_LEWI_OFF);
// Block current process until all the other process in the node that have
// registered a barrier named "Another barrier" reach the same barrier.
DLB_BarrierNamed(named_barrier);
DLB Barriers are blocking calls and as such sometimes provide a good scenario
for using LeWI. A named barrier can be configured to either
not use LeWI when blocked (DLB_BARRIER_LEWI_OFF), to use LeWI when possible
(DLB_BARRIER_LEWI_ON), or to decide at runtime (DLB_BARRIER_LEWI_RUNTIME).
In the latter case, DLB_ARGS can be extended with the option
--lewi-barrier-select=comma,separated,list,of,barriers to specify the names
of the barriers that will enable LeWI when blocked:
// Register a barrier named "yet_another_barrier". It will be decided at
// runtime whether this barrier enables LeWI.
dlb_barrier_t *named_barrier = DLB_BarrierNamedRegister(
"yet_another_barrier", DLB_BARRIER_LEWI_RUNTIME);
// Block on named barrier.
DLB_BarrierNamed(named_barrier);
Note
If --lewi-barrier-select is to be used, selected barriers do not accept
names with spaces. Refer to LewI option flags for more info.
We also provide a Fortran API for the DLB Barrier:
use iso_c_binding
implicit none
include 'dlbf.h'
integer :: ierr
type(c_ptr) :: dlb_named_barrier
...
! Call regular barrier function
ierr = DLB_Barrier()
...
! Register and call named barrier
dlb_named_barrier = DLB_BarrierNamedRegister(&
c_char_"named_barrier"//C_NULL_CHAR, DLB_BARRIER_LEWI_OFF)
ierr = DLB_BarrierNamed(dlb_named_barrier)
And for the Python API:
import dlb
...
# Call regular barrier function
dlb.DLB_Barrier()
...
# Register and call named barrier
handle = dlb.DLB_BarrierNamedRegister("named_barrier", dlb.DLB_BARRIER_LEWI_OFF)
dlb.DLB_BarrierNamed(handle)