TALP for monitoring the Programming Model efficiencies

TALP is a low-overhead profiling tool for collecting performance metrics from applications using MPI, OpenMP, OpenACC, CUDA, and HIP. Support for some of these programming models is still experimental.

These metrics can be reported at the end of the execution or queried at runtime through the API, which allows users to define user-defined regions and gain detailed insight into the performance of specific parts of the code.

An in-depth explanation of the metrics computed by TALP can be found here.

Here you can get an overview of the different runtime options available when using TALP.

A good way to get started is to just run your application and let TALP report the metrics at the end of the execution.

If you already have some executions of your application using TALP, you might want to check out TALP-Pages which can generate some plots using your JSON files.

Reporting POP metrics at the end of the execution

After installing DLB you can use TALP depending on the used programming models to report metrics at the end of the execution.

Note

Note that the flags shown below are the minimum requirements for TALP to report metrics at the end of the execution. You can also use --talp-output-file to generate CSV or JSON formatted files. More info in the options section below.

MPI-only executions

To gather and report the MPI-based performance metrics, you can run your application with libdlb_mpi pre-loaded and activate TALP by setting the DLB_ARGS:

DLB_PREFIX="<path-to-DLB-installation>"

export DLB_ARGS="--talp"
mpirun <options> env LD_PRELOAD="$DLB_PREFIX/lib/libdlb_mpi.so" ./foo

You will get a report similar to this on stderr at the end of the execution:

DLB[<hostname>:<pid>]: ############### Monitoring Region POP Metrics ###############
DLB[<hostname>:<pid>]: ### Name:                                     Global
DLB[<hostname>:<pid>]: ### Elapsed Time:                             5 s
DLB[<hostname>:<pid>]: ### Average IPC:                              0.23
DLB[<hostname>:<pid>]: ### Parallel efficiency:                      0.40
DLB[<hostname>:<pid>]: ### MPI Parallel efficiency:                  0.67
DLB[<hostname>:<pid>]: ###   - MPI Communication efficiency:         0.83
DLB[<hostname>:<pid>]: ###   - MPI Load Balance:                     0.80
DLB[<hostname>:<pid>]: ###       - MPI Load Balance in:              0.80
DLB[<hostname>:<pid>]: ###       - MPI Load Balance out:             1.00

OpenMP-only executions

As TALP relies on the OMPT interface to inspect runtime behavior of OpenMP, the runtime implementation needs to support this. Below you can find a table of currently supported runtimes and the respective compilers.

Compiler / Runtime	Support for TALP OpenMP metrics
LLVM-compilers like `clang,clang++`	supported with version `>=8.0.0`
Intel Classic-compilers `icc,icpc`	supported with version `>=19.0.1`
Intel LLVM-compilers `icx,icpx`	supported
GNU compilers `gcc,g++,gfortran`	not supported (see note below)
Cray clang compilers `craycc,crayc++`	supported with version `>=17.0.0`

If you are using any supported compiler in the table above to build your application, you can execute your application like this to gather OpenMP performance metrics with TALP:

DLB_PREFIX="<path-to-DLB-installation>"

export DLB_ARGS="--talp"
env LD_PRELOAD="$DLB_PREFIX/lib/libdlb.so" ./foo

Note

Since LLVM’s OpenMP runtime supports the GOMP interface, you can compile and link applications with GCC and explicitly link against the LLVM OpenMP runtime using:

gcc your_app.c -L<llvm_openmp_lib_dir> -Wl,-rpath,<llvm_openmp_lib_dir> -lomp

This allows you to use the LLVM OpenMP runtime even when compiling with GCC.

Note

The compilers listed here are relevant only for their default OpenMP runtimes. This does not refer to the compiler used to build the DLB library.

Hybrid (OpenMP+MPI) executions

For hybrid applications, TALP also needs an OpenMP runtime with OMPT support. Please make sure that you use a supported compiler to build your application.

The DLB_ARGS to configure TALP for hybrid applications are the same as for OpenMP ones, but this time libdlb_mpi.so is preloaded instead:

DLB_PREFIX="<path-to-DLB-installation>"

export DLB_ARGS="--talp"
mpirun <options> env LD_PRELOAD="$DLB_PREFIX/lib/libdlb_mpi.so" ./foo

After your program has finished you will get a report similar to this on stderr at the end of the execution:

DLB[<hostname>:<pid>]: ############### Monitoring Region POP Metrics ###############
DLB[<hostname>:<pid>]: ### Name:                                     Global
DLB[<hostname>:<pid>]: ### Elapsed Time:                             5 s
DLB[<hostname>:<pid>]: ### Average IPC:                              0.23
DLB[<hostname>:<pid>]: ### Parallel efficiency:                      0.40
DLB[<hostname>:<pid>]: ### MPI Parallel efficiency:                  0.67
DLB[<hostname>:<pid>]: ###   - MPI Communication efficiency:         0.83
DLB[<hostname>:<pid>]: ###   - MPI Load Balance:                     0.80
DLB[<hostname>:<pid>]: ###       - MPI Load Balance in:              0.80
DLB[<hostname>:<pid>]: ###       - MPI Load Balance out:             1.00
DLB[<hostname>:<pid>]: ### OpenMP Parallel efficiency:               0.60
DLB[<hostname>:<pid>]: ###   - OpenMP Load Balance:                  0.80
DLB[<hostname>:<pid>]: ###   - OpenMP Scheduling efficiency:         1.00
DLB[<hostname>:<pid>]: ###   - OpenMP Serialization efficiency:      0.75

NVIDIA GPU executions

For GPU applications running on NVIDIA devices, DLB must be previously be configured with --with-cuda (see DLB configure flags) so that it can locate the appropriate CUDA and CUPTI libraries.

Usage is similar to the previous examples. Enable TALP with --talp and the CUPTI backend plugin will be automatically loaded if it was configured and built. Additionally, if the application is not an MPI program, DLB may need to be auto-initialized with a special environment variable:

# Pure GPU execution, DLB needs to be auto-intialized
export DLB_AUTO_INIT=1
export DLB_ARGS="--talp"
LD_PRELOAD="$DLB_PREFIX/lib/libdlb.so" ./foo

# Hybrid MPI+GPU execution, preload libdlb_mpi.so as usual, DLB_AUTO_INIT not needed
export DLB_ARGS="--talp"
mpirun <options> env LD_PRELOAD="$DLB_PREFIX/lib/libdlb_mpi.so" ./foo

A hybrid MPI+GPU execution will result in an output similar to this:

DLB[<hostname>:<pid>]: ############### Monitoring Region POP Metrics ################
DLB[<hostname>:<pid>]: ### Name:                                     Global
DLB[<hostname>:<pid>]: ### Elapsed Time:                             10.16 s
DLB[<hostname>:<pid>]: ### Host
DLB[<hostname>:<pid>]: ### ----
DLB[<hostname>:<pid>]: ### Parallel efficiency:                      0.99
DLB[<hostname>:<pid>]: ###  - MPI Parallel efficiency:               1.00
DLB[<hostname>:<pid>]: ###     - Communication efficiency:           1.00
DLB[<hostname>:<pid>]: ###     - Load Balance:                       1.00
DLB[<hostname>:<pid>]: ###        - In:                              1.00
DLB[<hostname>:<pid>]: ###        - Out:                             1.00
DLB[<hostname>:<pid>]: ###  - Device Offload efficiency:             0.99
DLB[<hostname>:<pid>]: ###
DLB[<hostname>:<pid>]: ### NVIDIA Device
DLB[<hostname>:<pid>]: ### -------------
DLB[<hostname>:<pid>]: ### Parallel efficiency:                      0.50
DLB[<hostname>:<pid>]: ###  - Load Balance:                          1.00
DLB[<hostname>:<pid>]: ###  - Communication efficiency:              1.00
DLB[<hostname>:<pid>]: ###  - Orchestration efficiency:              0.50

AMD GPU executions

For GPU applications running on AMD devices, DLB must be previously configured with --with-rocm (see DLB configure flags) so that it can locate the appropriate ROCm libraries.

Once DLB is built with ROCm support, GPU profiling backends are automatically detected and enabled at runtime when available. Therefore, users only need to enable TALP with --talp and DLB will attempt to load the appropriate backend.

Internally, DLB may use one of two ROCm profiling interfaces depending on the ROCm version available on the system:

rocprofiler-sdk: uses the officially supported librocprofiler-sdk, available starting with ROCm 6.2 and the recommended method for collecting performance metrics on AMD GPUs.
rocprofilerv2: uses ROCm’s deprecated librocprofilerv2. While still supported by DLB for compatibility with older ROCm installations, it may fail to obtain metrics in some cases.

The appropriate backend is automatically selected at runtime.

# Pure GPU execution (DLB_AUTO_INIT is not needed for ROCm >= 6.2)
export DLB_ARGS="--talp"
LD_PRELOAD="$DLB_PREFIX/lib/libdlb.so" ./foo

# Hybrid MPI+GPU execution
export DLB_ARGS="--talp"
mpirun <options> env LD_PRELOAD="$DLB_PREFIX/lib/libdlb_mpi.so" ./foo

Hybrid MPI+GPU executions on AMD devices produce output comparable to the NVIDIA example shown above. A separate example is omitted for brevity.

Defining custom monitoring regions

TALP utilizes monitoring regions to track and report performance metrics. A monitoring region is a designated section of code marked for tracking. Initially, TALP defines a default monitoring region, called “Global”, which spans from DLB_Init to DLB_Finalize. Additionally, users can create custom monitoring regions through the DLB API.

Note

The region between DLB_Init and DLB_Finalize can vary depending on the initialisaton method used, whether it’s automatic initialisation with MPI or OpenMP, or direct initialisation through the DLB API.

A monitoring region can be registered using the DLB_MonitoringRegionRegister function. Multiple calls with the same non-null char pointer will return the same region. The region does not begin until the function DLB_MonitoringRegionStart is called, and must end with the function DLB_MonitoringRegionStop. A monitoring region may be paused and resumed multiple times. All user-defined regions should be stopped before MPI_Finalize.

Here are a few restrictions for naming monitoring regions:

The name “Global” (case-insensitive) is reserved and cannot be used for any user-defined region. If the user attempts to register a region with this name, a pointer to the global region will be returned.
The name “all” (case-insensitive) is reserved and cannot be used. Attempting to register a region with this name will result in an error.
For user-defined regions, the name is case-sensitive, can contain up to 128 characters, and may include spaces (though spaces must be avoided when using the flag --talp-region-select, as explained explained below).

Basic usage examples for C:

#include <dlb_talp.h>
...
dlb_monitor_t *monitor = DLB_MonitoringRegionRegister("Region 1");
...
while (...) {
    ...
    /* Resume region */
    DLB_MonitoringRegionStart(monitor);
    ...
    /* Pause region */
    DLB_MonitoringRegionStop(monitor);
}

Basic usage for Fortran:

use iso_c_binding
implicit none
include 'dlbf_talp.h'
type(c_ptr) :: dlb_handle
integer :: err
...
dlb_handle = DLB_MonitoringRegionRegister(c_char_"Region 1"//C_NULL_CHAR)
...
do ...
    ! Resume region
    err = DLB_MonitoringRegionStart(dlb_handle)
    ...
    ! Pause region
    err = DLB_MonitoringRegionStop(dlb_handle)
enddo

And for Python:

import dlb
...
monitor = dlb.DLB_MonitoringRegionRegister("Region 1")
...
for ... :
    # Resume region
    dlb.DLB_MonitoringRegionStart(monitor)
    ...
    # Pause region
    dlb.DLB_MonitoringRegionStop(monitor)

For each defined monitoring region, including the global region, TALP will print or write a summary at the end of the execution.

Note

See Example 3 of How to run with DLB for more information on compiling and linking with the DLB library.

Inspecting monitoring regions within the source code

The struct dlb_monitor_t is defined in dlb_talp.h. Its fields can be accessed at any time, although to guarantee that the values are up to date the region needs to be stopped.

For Fortran codes, the struct can be accessed as in this example:

use iso_c_binding
implicit none
include 'dlbf_talp.h'
type(c_ptr) :: dlb_handle
type(dlb_monitor_t), pointer :: dlb_monitor
integer :: err
character(16), pointer :: monitor_name
...
dlb_handle = DLB_MonitoringRegionRegister(c_char_"Region 1"//C_NULL_CHAR)
err = DLB_MonitoringRegionStart(dlb_handle)
err = DLB_MonitoringRegionStop(dlb_handle)
...
call c_f_pointer(dlb_handle, dlb_monitor)
call c_f_pointer(dlb_monitor%name_, monitor_name)
print *, monitor_name
print *, dlb_monitor%num_measurements
print *, dlb_monitor%elapsed_time

And for Python codes:

import dlb

dlb_handle = dlb.DLB_MonitoringRegionRegister("Region 1")

dlb.DLB_MonitoringRegionStart(dlb_handle)
dlb.DLB_MonitoringRegionStop(dlb_handle)

monitor = dlb_handle.contents

print(monitor.name.decode())
print(monitor.num_measurements)
print(monitor.elapsed_time)

Special values for monitoring regions

The special values DLB_GLOBAL_REGION and DLB_LAST_OPEN_REGION can be used in any TALP function to refer to these contexts without needing to pass the region handle explicitly:

// Helper class to create a region for the current scope
struct Profiler {
    Profiler(const std::string& name) {
        dlb_monitor_t *monitor = DLB_MonitoringRegionRegister(name.c_str());
        DLB_MonitoringRegionStart(monitor);
    }

    ~Profiler() {
        DLB_MonitoringRegionStop(DLB_LAST_OPEN_REGION);
    }
};

void foo() {
    // Everything in this scope is recorded as "Region 1"
    {
        Profiler p("Region 1");
        ...
    }

    // Everything in this scope is recorded as "Region 2"
    {
        Profiler p("Region 2");
        ...
    }

    // Print current Global region metrics
    DLB_MonitoringRegionReport(DLB_GLOBAL_REGION);
}

In Fortran, DLB_GLOBAL_REGION is defined as type(c_ptr) and can be used similarly to how it’s used in C. Additionally, DLB_GLOBAL_REGION_INT and DLB_LAST_OPEN_REGION_INT are defined as integer(kind=c_intptr_t) and must be converted to type(c_ptr)) using the F90 intrinsic procedure transfer:

! Print current Global region metrics
err = DLB_MonitoringRegionReport(DLB_GLOBAL_REGION)

! Equivalent, using integer(c_intptr_t)
err = DLB_MonitoringRegionReport(transfer(DLB_GLOBAL_REGION_INT, c_null_ptr))

! Start region and stop
err = DLB_MonitoringRegionStart(dlb_handle)
err = DLB_MonitoringRegionStop(transfer(DLB_LAST_OPEN_REGION_INT, c_null_ptr))

In Python, values DLB_GLOBAL_REGION and DLB_LAST_OPEN_REGION are used the same way as in C codes, without needing to pass the region handle explicitly:

import dlb

# Print current Global region metrics
dlb.DLB_MonitoringRegionReport(dlb.DLB_GLOBAL_REGION)

# Start region and stop
dlb.DLB_MonitoringRegionStart(dlb_handle)
dlb.DLB_MonitoringRegionStop(dlb.DLB_LAST_OPEN_REGION)

Computing POP metrics for a region at run time

POP metrics can be obtained at any point by calling a collective DLB function that gathers data from all processes. The call accepts a dlb_monitor_t to specify a region, or DLB_GLOBAL_REGION for the implicit global region. The returned struct contains both intermediate values and final efficiency ratios, the latter shown below:

typedef struct dlb_pop_metrics_t {
    ...
    float parallel_efficiency;
    float mpi_parallel_efficiency;
    float mpi_communication_efficiency;
    float mpi_load_balance;
    float mpi_load_balance_in;
    float mpi_load_balance_out;
    float omp_parallel_efficiency;
    float omp_load_balance;
    float omp_scheduling_efficiency;
    float omp_serialization_efficiency;
} dlb_pop_metrics_t;

Example in C:

#include <dlb_talp.h>
...
dlb_monitor_t *monitor = DLB_MonitoringRegionRegister("Region 1");
// The monitor is then used to start and stop the region.
...
dlb_pop_metrics_t pop_metrics;
// This call performs an MPI synchronization across all processes.
int err = DLB_TALP_CollectPOPMetrics(monitor, &pop_metrics);
printf("%1.2f\n", pop_metrics.parallel_efficiency);
printf("%1.2f\n", pop_metrics.mpi_communication_efficiency);
...

Example in Fortran:

use iso_c_binding
implicit none
include 'dlbf_talp.h'
type(c_ptr) :: dlb_handle
type(dlb_pop_metrics_t) :: pop_metrics
integer :: err
...
dlb_handle = DLB_MonitoringRegionRegister(c_char_"Region 1"//C_NULL_CHAR)
! The dlb_handle is then used to start and stop the region.
...
! This call performs an MPI synchronization across all processes.
err = DLB_TALP_CollectPOPMetrics(dlb_handle, pop_metrics)
print *, pop_metrics%parallel_efficiency
print *, pop_metrics%mpi_communication_efficiency

For Python MPI programs, the dlb_mpi module must be used instead of the base dlb module. These calls also require running with the DLB MPI library preloaded.

import dlb_mpi
...
dlb_handle = dlb_mpi.DLB_MonitoringRegionRegister("Region 1")
# The dlb_handle is then used to start and stop the region.
...
# This call performs an MPI synchronization across all processes.
pop_metrics = dlb_mpi.DLB_TALP_CollectPOPMetrics(dlb_handle)

print(f"Parallel efficiency: {pop_metrics.parallel_efficiency:.2f}")
print(f"MPI communication efficiency: {pop_metrics.mpi_communication_efficiency:.2f}")

Enabling Hardware Counters

Configure DLB with --with-papi to enable support for hardware performance counters through PAPI. When PAPI is available and the system permits access to performance counters, TALP will also report the average IPC (Instructions Per Cycle).

TALP option flags

--talp=<none:default:mpi:openmp:gpu:hwc>

Enable the TALP (Tracking Application Live Performance) profiler.

The option accepts a comma-separated list of profiling components:

default Enable optional profiling for supported components
mpi MPI activity
openmp OpenMP activity
gpu GPU activity
hwc Hardware counters
none Disable TALP

If the option is specified without value (i.e., --talp), it is equivalent to --talp=default.

Expected usage is one of:

--talp=none
--talp or --talp=default
--talp=<component>[,<component>...]

Examples:

--talp: Enable profiling for all available components (auto-detected).
--talp=gpu: Enable GPU profiling.
--talp=mpi,openmp: Enable profiling of MPI and OpenMP activity.

--talp-summary=<none:all:pop-metrics:process>

Report TALP metrics at the end of the execution. If --talp-output-file is not specified, a short summary is printed. Otherwise, a more verbose file will be generated with all the metrics collected by TALP, depending on the list of requested summaries, separated by ::

pop-metrics, the default option, will report a subset of the POP metrics.

process will report the measurements of each process for each registered region.

Deprecated options:

pop-raw will be removed in the next release. The output will be available via the pop-metrics summary.

node will be removed in the next release. Its data can be derived from the process report.

--talp-external-profiler=<bool>

Enable live metrics update to the shared memory. This flag is only needed if there is an external program monitoring the application.

--talp-output-file=<path>

Write extended TALP metrics to a file. If omitted, output is written to stderr.

The output format is determined by the file extension:

*.json JSON (file is overwritten)
*.csv CSV (rows are appended)
other Plain text

The filename may contain replacement tokens:

%h Hostname
%p Process ID (PID)
%j Job ID from the environment (e.g., Slurm, Flux)
%% Literal ‘%’

--talp-partial-output=<bool>

Write one profiling output file per process instead of a single merged file. Only supported when the output format is JSON.

--talp-region-select=<string>

Select TALP regions to enable. This option follows the format: --talp-region-select=[(include|exclude):]<region-list>

The modifiers include: and exclude: are optional, but only one modifier can be used at a time. If neither is specified, include: is assumed by default.

The <region-list> can be a comma-separated list of regions or a special token all to refer to all regions. The global monitoring region may be specified with the special token global. If the modifier include: is used, only the listed regions will be enabled. If exclude: is used, all regions will be enabled except for the ones specified.

Note that when using this feature, listed regions must not have spaces in their names.

e.g.: --talp-region-select=all (default), --talp-region-select=exclude:all, --talp-region-select=include:global,region3, --talp-region-select=exclude:region4.