TALP for monitoring the Programming Model efficiencies
TALP is a low-overhead profiling tool for collecting performance metrics from applications using MPI, OpenMP, OpenACC, CUDA, and HIP. Support for some of these programming models is still experimental.
These metrics can be reported at the end of the execution or queried at runtime through the API, which allows users to define user-defined regions and gain detailed insight into the performance of specific parts of the code.
An in-depth explanation of the metrics computed by TALP can be found here.
Here you can get an overview of the different runtime options available when using TALP.
A good way to get started is to just run your application and let TALP report the metrics at the end of the execution.
If you already have some executions of your application using TALP, you might want to check out TALP-Pages which can generate some plots using your JSON files.
Reporting POP metrics at the end of the execution
After installing DLB you can use TALP depending on the used programming models to report metrics at the end of the execution.
Note
Note that the flags shown below are the minimum requirements for TALP to report metrics at the end of the execution.
You can also use --talp-output-file to generate CSV or JSON formatted files.
More info in the options section below.
MPI-only executions
To gather and report the MPI-based performance metrics, you can run your application with libdlb_mpi pre-loaded and activate TALP by setting the DLB_ARGS:
DLB_PREFIX="<path-to-DLB-installation>"
export DLB_ARGS="--talp"
mpirun <options> env LD_PRELOAD="$DLB_PREFIX/lib/libdlb_mpi.so" ./foo
You will get a report similar to this on stderr at the end of the execution:
DLB[<hostname>:<pid>]: ############### Monitoring Region POP Metrics ###############
DLB[<hostname>:<pid>]: ### Name: Global
DLB[<hostname>:<pid>]: ### Elapsed Time: 5 s
DLB[<hostname>:<pid>]: ### Average IPC: 0.23
DLB[<hostname>:<pid>]: ### Parallel efficiency: 0.40
DLB[<hostname>:<pid>]: ### MPI Parallel efficiency: 0.67
DLB[<hostname>:<pid>]: ### - MPI Communication efficiency: 0.83
DLB[<hostname>:<pid>]: ### - MPI Load Balance: 0.80
DLB[<hostname>:<pid>]: ### - MPI Load Balance in: 0.80
DLB[<hostname>:<pid>]: ### - MPI Load Balance out: 1.00
OpenMP-only executions
As TALP relies on the OMPT interface to inspect runtime behavior of OpenMP, the runtime implementation needs to support this. Below you can find a table of currently supported runtimes and the respective compilers.
Compiler / Runtime |
Support for TALP OpenMP metrics |
|---|---|
LLVM-compilers like |
supported with version |
Intel Classic-compilers |
supported with version |
Intel LLVM-compilers |
supported |
GNU compilers |
not supported (see note below) |
Cray clang compilers |
supported with version |
If you are using any supported compiler in the table above to build your application, you can execute your application like this to gather OpenMP performance metrics with TALP:
DLB_PREFIX="<path-to-DLB-installation>"
export DLB_ARGS="--talp"
env LD_PRELOAD="$DLB_PREFIX/lib/libdlb.so" ./foo
Note
Since LLVM’s OpenMP runtime supports the GOMP interface, you can compile and link applications with GCC and explicitly link against the LLVM OpenMP runtime using:
gcc your_app.c -L<llvm_openmp_lib_dir> -Wl,-rpath,<llvm_openmp_lib_dir> -lomp
This allows you to use the LLVM OpenMP runtime even when compiling with GCC.
Note
The compilers listed here are relevant only for their default OpenMP runtimes. This does not refer to the compiler used to build the DLB library.
Hybrid (OpenMP+MPI) executions
For hybrid applications, TALP also needs an OpenMP runtime with OMPT support. Please make sure that you use a supported compiler to build your application.
The DLB_ARGS to configure TALP for hybrid applications are the same as for OpenMP ones, but this time libdlb_mpi.so is preloaded instead:
DLB_PREFIX="<path-to-DLB-installation>"
export DLB_ARGS="--talp"
mpirun <options> env LD_PRELOAD="$DLB_PREFIX/lib/libdlb_mpi.so" ./foo
After your program has finished you will get a report similar to this on stderr at the end of the execution:
DLB[<hostname>:<pid>]: ############### Monitoring Region POP Metrics ###############
DLB[<hostname>:<pid>]: ### Name: Global
DLB[<hostname>:<pid>]: ### Elapsed Time: 5 s
DLB[<hostname>:<pid>]: ### Average IPC: 0.23
DLB[<hostname>:<pid>]: ### Parallel efficiency: 0.40
DLB[<hostname>:<pid>]: ### MPI Parallel efficiency: 0.67
DLB[<hostname>:<pid>]: ### - MPI Communication efficiency: 0.83
DLB[<hostname>:<pid>]: ### - MPI Load Balance: 0.80
DLB[<hostname>:<pid>]: ### - MPI Load Balance in: 0.80
DLB[<hostname>:<pid>]: ### - MPI Load Balance out: 1.00
DLB[<hostname>:<pid>]: ### OpenMP Parallel efficiency: 0.60
DLB[<hostname>:<pid>]: ### - OpenMP Load Balance: 0.80
DLB[<hostname>:<pid>]: ### - OpenMP Scheduling efficiency: 1.00
DLB[<hostname>:<pid>]: ### - OpenMP Serialization efficiency: 0.75
NVIDIA GPU executions
For GPU applications running on NVIDIA devices, DLB must be previously be configured with
--with-cuda (see DLB configure flags) so that it can locate the appropriate CUDA
and CUPTI libraries.
Usage is similar to the previous examples. Enable TALP with --talp and the
CUPTI backend plugin will be automatically loaded if it was configured and built.
Additionally, if the application is not an MPI program,
DLB may need to be auto-initialized with a special environment variable:
# Pure GPU execution, DLB needs to be auto-intialized
export DLB_AUTO_INIT=1
export DLB_ARGS="--talp"
LD_PRELOAD="$DLB_PREFIX/lib/libdlb.so" ./foo
# Hybrid MPI+GPU execution, preload libdlb_mpi.so as usual, DLB_AUTO_INIT not needed
export DLB_ARGS="--talp"
mpirun <options> env LD_PRELOAD="$DLB_PREFIX/lib/libdlb_mpi.so" ./foo
A hybrid MPI+GPU execution will result in an output similar to this:
DLB[<hostname>:<pid>]: ############### Monitoring Region POP Metrics ################
DLB[<hostname>:<pid>]: ### Name: Global
DLB[<hostname>:<pid>]: ### Elapsed Time: 10.16 s
DLB[<hostname>:<pid>]: ### Host
DLB[<hostname>:<pid>]: ### ----
DLB[<hostname>:<pid>]: ### Parallel efficiency: 0.99
DLB[<hostname>:<pid>]: ### - MPI Parallel efficiency: 1.00
DLB[<hostname>:<pid>]: ### - Communication efficiency: 1.00
DLB[<hostname>:<pid>]: ### - Load Balance: 1.00
DLB[<hostname>:<pid>]: ### - In: 1.00
DLB[<hostname>:<pid>]: ### - Out: 1.00
DLB[<hostname>:<pid>]: ### - Device Offload efficiency: 0.99
DLB[<hostname>:<pid>]: ###
DLB[<hostname>:<pid>]: ### NVIDIA Device
DLB[<hostname>:<pid>]: ### -------------
DLB[<hostname>:<pid>]: ### Parallel efficiency: 0.50
DLB[<hostname>:<pid>]: ### - Load Balance: 1.00
DLB[<hostname>:<pid>]: ### - Communication efficiency: 1.00
DLB[<hostname>:<pid>]: ### - Orchestration efficiency: 0.50
AMD GPU executions
For GPU applications running on AMD devices, DLB must be previously configured with
--with-rocm (see DLB configure flags) so that it can locate the appropriate
ROCm libraries.
Once DLB is built with ROCm support, GPU profiling backends are automatically
detected and enabled at runtime when available. Therefore, users only need to
enable TALP with --talp and DLB will attempt to load the appropriate backend.
Internally, DLB may use one of two ROCm profiling interfaces depending on the ROCm version available on the system:
rocprofiler-sdk: uses the officially supportedlibrocprofiler-sdk, available starting with ROCm 6.2 and the recommended method for collecting performance metrics on AMD GPUs.rocprofilerv2: uses ROCm’s deprecatedlibrocprofilerv2. While still supported by DLB for compatibility with older ROCm installations, it may fail to obtain metrics in some cases.
The appropriate backend is automatically selected at runtime.
# Pure GPU execution (DLB_AUTO_INIT is not needed for ROCm >= 6.2)
export DLB_ARGS="--talp"
LD_PRELOAD="$DLB_PREFIX/lib/libdlb.so" ./foo
# Hybrid MPI+GPU execution
export DLB_ARGS="--talp"
mpirun <options> env LD_PRELOAD="$DLB_PREFIX/lib/libdlb_mpi.so" ./foo
Hybrid MPI+GPU executions on AMD devices produce output comparable to the NVIDIA example shown above. A separate example is omitted for brevity.
Defining custom monitoring regions
TALP utilizes monitoring regions to track and report performance metrics. A
monitoring region is a designated section of code marked for tracking.
Initially, TALP defines a default monitoring region, called “Global”, which
spans from DLB_Init to DLB_Finalize. Additionally, users can create
custom monitoring regions through the DLB API.
Note
The region between DLB_Init and DLB_Finalize can vary depending
on the initialisaton method used, whether it’s automatic initialisation
with MPI or OpenMP, or direct initialisation through the DLB API.
A monitoring region can be registered using the
DLB_MonitoringRegionRegister function. Multiple calls with the same
non-null char pointer will return the same region. The region does not begin
until the function DLB_MonitoringRegionStart is called, and must end with
the function DLB_MonitoringRegionStop.
A monitoring region may be paused and resumed multiple times.
All user-defined regions should be stopped before MPI_Finalize.
Here are a few restrictions for naming monitoring regions:
The name “Global” (case-insensitive) is reserved and cannot be used for any user-defined region. If the user attempts to register a region with this name, a pointer to the global region will be returned.
The name “all” (case-insensitive) is reserved and cannot be used. Attempting to register a region with this name will result in an error.
For user-defined regions, the name is case-sensitive, can contain up to 128 characters, and may include spaces (though spaces must be avoided when using the flag
--talp-region-select, as explained explained below).
Basic usage examples for C:
#include <dlb_talp.h>
...
dlb_monitor_t *monitor = DLB_MonitoringRegionRegister("Region 1");
...
while (...) {
...
/* Resume region */
DLB_MonitoringRegionStart(monitor);
...
/* Pause region */
DLB_MonitoringRegionStop(monitor);
}
Basic usage for Fortran:
use iso_c_binding
implicit none
include 'dlbf_talp.h'
type(c_ptr) :: dlb_handle
integer :: err
...
dlb_handle = DLB_MonitoringRegionRegister(c_char_"Region 1"//C_NULL_CHAR)
...
do ...
! Resume region
err = DLB_MonitoringRegionStart(dlb_handle)
...
! Pause region
err = DLB_MonitoringRegionStop(dlb_handle)
enddo
And for Python:
import dlb
...
monitor = dlb.DLB_MonitoringRegionRegister("Region 1")
...
for ... :
# Resume region
dlb.DLB_MonitoringRegionStart(monitor)
...
# Pause region
dlb.DLB_MonitoringRegionStop(monitor)
For each defined monitoring region, including the global region, TALP will print or write a summary at the end of the execution.
Note
See Example 3 of How to run with DLB for more information on compiling and linking with the DLB library.
Inspecting monitoring regions within the source code
The struct dlb_monitor_t is defined in dlb_talp.h. Its fields can be
accessed at any time, although to guarantee that the values are up to date the
region needs to be stopped.
For Fortran codes, the struct can be accessed as in this example:
use iso_c_binding
implicit none
include 'dlbf_talp.h'
type(c_ptr) :: dlb_handle
type(dlb_monitor_t), pointer :: dlb_monitor
integer :: err
character(16), pointer :: monitor_name
...
dlb_handle = DLB_MonitoringRegionRegister(c_char_"Region 1"//C_NULL_CHAR)
err = DLB_MonitoringRegionStart(dlb_handle)
err = DLB_MonitoringRegionStop(dlb_handle)
...
call c_f_pointer(dlb_handle, dlb_monitor)
call c_f_pointer(dlb_monitor%name_, monitor_name)
print *, monitor_name
print *, dlb_monitor%num_measurements
print *, dlb_monitor%elapsed_time
And for Python codes:
import dlb
dlb_handle = dlb.DLB_MonitoringRegionRegister("Region 1")
dlb.DLB_MonitoringRegionStart(dlb_handle)
dlb.DLB_MonitoringRegionStop(dlb_handle)
monitor = dlb_handle.contents
print(monitor.name.decode())
print(monitor.num_measurements)
print(monitor.elapsed_time)
Special values for monitoring regions
The special values DLB_GLOBAL_REGION and DLB_LAST_OPEN_REGION can be
used in any TALP function to refer to these contexts without needing to pass
the region handle explicitly:
// Helper class to create a region for the current scope
struct Profiler {
Profiler(const std::string& name) {
dlb_monitor_t *monitor = DLB_MonitoringRegionRegister(name.c_str());
DLB_MonitoringRegionStart(monitor);
}
~Profiler() {
DLB_MonitoringRegionStop(DLB_LAST_OPEN_REGION);
}
};
void foo() {
// Everything in this scope is recorded as "Region 1"
{
Profiler p("Region 1");
...
}
// Everything in this scope is recorded as "Region 2"
{
Profiler p("Region 2");
...
}
// Print current Global region metrics
DLB_MonitoringRegionReport(DLB_GLOBAL_REGION);
}
In Fortran, DLB_GLOBAL_REGION is defined as type(c_ptr) and can be
used similarly to how it’s used in C. Additionally, DLB_GLOBAL_REGION_INT and
DLB_LAST_OPEN_REGION_INT are defined as integer(kind=c_intptr_t) and must
be converted to type(c_ptr)) using the F90 intrinsic procedure transfer:
! Print current Global region metrics
err = DLB_MonitoringRegionReport(DLB_GLOBAL_REGION)
! Equivalent, using integer(c_intptr_t)
err = DLB_MonitoringRegionReport(transfer(DLB_GLOBAL_REGION_INT, c_null_ptr))
! Start region and stop
err = DLB_MonitoringRegionStart(dlb_handle)
err = DLB_MonitoringRegionStop(transfer(DLB_LAST_OPEN_REGION_INT, c_null_ptr))
In Python, values DLB_GLOBAL_REGION and DLB_LAST_OPEN_REGION are used the same way as in C codes, without needing to pass the region handle explicitly:
import dlb
# Print current Global region metrics
dlb.DLB_MonitoringRegionReport(dlb.DLB_GLOBAL_REGION)
# Start region and stop
dlb.DLB_MonitoringRegionStart(dlb_handle)
dlb.DLB_MonitoringRegionStop(dlb.DLB_LAST_OPEN_REGION)
Computing POP metrics for a region at run time
POP metrics can be obtained at any point by calling a collective DLB function
that gathers data from all processes. The call accepts a dlb_monitor_t to
specify a region, or DLB_GLOBAL_REGION for the implicit global region. The
returned struct contains both intermediate values and final efficiency ratios,
the latter shown below:
typedef struct dlb_pop_metrics_t {
...
float parallel_efficiency;
float mpi_parallel_efficiency;
float mpi_communication_efficiency;
float mpi_load_balance;
float mpi_load_balance_in;
float mpi_load_balance_out;
float omp_parallel_efficiency;
float omp_load_balance;
float omp_scheduling_efficiency;
float omp_serialization_efficiency;
} dlb_pop_metrics_t;
Example in C:
#include <dlb_talp.h>
...
dlb_monitor_t *monitor = DLB_MonitoringRegionRegister("Region 1");
// The monitor is then used to start and stop the region.
...
dlb_pop_metrics_t pop_metrics;
// This call performs an MPI synchronization across all processes.
int err = DLB_TALP_CollectPOPMetrics(monitor, &pop_metrics);
printf("%1.2f\n", pop_metrics.parallel_efficiency);
printf("%1.2f\n", pop_metrics.mpi_communication_efficiency);
...
Example in Fortran:
use iso_c_binding
implicit none
include 'dlbf_talp.h'
type(c_ptr) :: dlb_handle
type(dlb_pop_metrics_t) :: pop_metrics
integer :: err
...
dlb_handle = DLB_MonitoringRegionRegister(c_char_"Region 1"//C_NULL_CHAR)
! The dlb_handle is then used to start and stop the region.
...
! This call performs an MPI synchronization across all processes.
err = DLB_TALP_CollectPOPMetrics(dlb_handle, pop_metrics)
print *, pop_metrics%parallel_efficiency
print *, pop_metrics%mpi_communication_efficiency
For Python MPI programs, the dlb_mpi module must be used instead of the base dlb module. These calls also require running with the DLB MPI library preloaded.
import dlb_mpi
...
dlb_handle = dlb_mpi.DLB_MonitoringRegionRegister("Region 1")
# The dlb_handle is then used to start and stop the region.
...
# This call performs an MPI synchronization across all processes.
pop_metrics = dlb_mpi.DLB_TALP_CollectPOPMetrics(dlb_handle)
print(f"Parallel efficiency: {pop_metrics.parallel_efficiency:.2f}")
print(f"MPI communication efficiency: {pop_metrics.mpi_communication_efficiency:.2f}")
Enabling Hardware Counters
Configure DLB with --with-papi to enable support
for hardware performance counters through PAPI. When PAPI is available and the
system permits access to performance counters, TALP will also report the
average IPC (Instructions Per Cycle).
TALP option flags
- --talp=<none:default:mpi:openmp:gpu:hwc>
Enable the TALP (Tracking Application Live Performance) profiler.
- The option accepts a comma-separated list of profiling components:
defaultEnable optional profiling for supported componentsmpiMPI activityopenmpOpenMP activitygpuGPU activityhwcHardware countersnoneDisable TALP
If the option is specified without value (i.e.,
--talp), it is equivalent to--talp=default.- Expected usage is one of:
--talp=none--talpor--talp=default--talp=<component>[,<component>...]
- Examples:
--talpEnable profiling for all available components (auto-detected).
--talp=gpuEnable GPU profiling.
--talp=mpi,openmpEnable profiling of MPI and OpenMP activity.
- --talp-summary=<none:all:pop-metrics:process>
Report TALP metrics at the end of the execution. If
--talp-output-fileis not specified, a short summary is printed. Otherwise, a more verbose file will be generated with all the metrics collected by TALP, depending on the list of requested summaries, separated by::pop-metrics, the default option, will report a subset of the POP metrics.processwill report the measurements of each process for each registered region.Deprecated options:
pop-rawwill be removed in the next release. The output will be available via thepop-metricssummary.nodewill be removed in the next release. Its data can be derived from theprocessreport.- --talp-external-profiler=<bool>
Enable live metrics update to the shared memory. This flag is only needed if there is an external program monitoring the application.
- --talp-output-file=<path>
Write extended TALP metrics to a file. If omitted, output is written to stderr.
- The output format is determined by the file extension:
*.jsonJSON (file is overwritten)*.csvCSV (rows are appended)other Plain text
- The filename may contain replacement tokens:
%hHostname%pProcess ID (PID)%jJob ID from the environment (e.g., Slurm, Flux)%%Literal ‘%’
- --talp-partial-output=<bool>
Write one profiling output file per process instead of a single merged file. Only supported when the output format is JSON.
- --talp-region-select=<string>
Select TALP regions to enable. This option follows the format:
--talp-region-select=[(include|exclude):]<region-list>The modifiers
include:andexclude:are optional, but only one modifier can be used at a time. If neither is specified,include:is assumed by default.The
<region-list>can be a comma-separated list of regions or a special tokenallto refer to all regions. The global monitoring region may be specified with the special tokenglobal. If the modifierinclude:is used, only the listed regions will be enabled. Ifexclude:is used, all regions will be enabled except for the ones specified.Note that when using this feature, listed regions must not have spaces in their names.
e.g.:
--talp-region-select=all(default),--talp-region-select=exclude:all,--talp-region-select=include:global,region3,--talp-region-select=exclude:region4.