Suite of performance analysis tools and toolkit for building performance analysis tools


Keywords
performance, profiling, sampling, hardware, counters, pybind11, timing, memory, gpu, cupti, cuda, papi, gperftools, craypat, ompt, likwid, roofline, TAU, vtune
License
MIT
Install
pip install timemory==3.2.1rc11

Documentation

timemory

Timing + Memory + Hardware Counter Utilities for C / C++ / CUDA / Python

Build Status Build status codecov

Conda Recipe Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

timemory on GitHub (Source code)

timemory General Documentation

timemory Source Code Documentation (Doxygen)

timemory Testing Dashboard (CDash)

GitHub git clone https://github.com/NERSC/timemory.git
PyPi pip install timemory
Anaconda Cloud conda install -c jrmadsen timemory

Timemory is a performance measurement and analysis framework.

Why Use timemory?

  • Timemory is arguably the most customizable performance measurement and analysis API available
  • High-performance: very low overhead when enabled and borderline negligible runtime disabled
  • Ability to arbitrarily switch and combine different measurement types anywhere in application
  • Provides static reporting (fixed at compile-time), dynamic reporting (selected at run-time), or hybrid
    • Enable static wall-clock and cpu-clock reporting with ability to dynamically enable hardware-counters at runtime

Support for Multiple Instrumentation Marker APIs

  • NVTX for Nsight-Systems and NVprof
  • LIKWID
  • Caliper
  • TAU
  • ittnotify (Intel VTune and Advisor)

Create Your Own Performance and Analysis Tools

  • Written in C++
  • Direct access to performance analysis data in Python and C++
  • Create your own components: any one-time measurement or start/stop paradigm can be wrapped with timemory
    • Flexible and easily extensible interface: no data type restrictions in custom components

Generic Bundling of Multiple Tools

  • CPU hardware counters via PAPI
  • NVIDIA GPU hardware counters via CUPTI
  • NVIDIA GPU tracing via CUPTI
  • Generating a Roofline for performance-critical sections on the CPU and NVIDIA GPUs
  • Memory usage
  • Tool insertiong around malloc, calloc, free, cudaMalloc, cudaFree
  • Wall-clock, cpu-clock, system-clock timing
  • Number of bytes read/written to file-system (and rate)
  • Number of context switches
  • Trip counts
  • CUDA kernel runtime(s)

Powerful GOTCHA Extensions

  • GOTCHA is an API for LD_PRELOAD
    • Significantly simplify existing implementations
  • Scoped GOTCHA
  • Use gotcha component to replace external function calls with own instrumentation
  • Use gotcha component to instrument external library calls

Multi-language Support

  • Variadic interface to all the utilities from C code
  • Variadic interface to all the utilities from C++ code
  • Variadic interface to all the utilities from Python code
    • Includes context-managers and decorators

Overview

Timemory is generic C++11 template library providing a variety of performance components for reporting timing, resource usage, hardware counters for the CPU and GPU, roofline generation, and simplified generation of GOTCHA wrappers to instrument external library function calls.

Timemory provides also provides Python and C interfaces.

Purpose

The goal of the package is to provide as easy way to regularly report on the performance of your code. If you have ever added something like this in your code:

tstart = time.now()
# do something
tstop = time.now()
print("Elapsed time: {}".format(tstop - tstart))

Timemory streamlines this work. In C++ codes, all you have to do is include the headers. It comes in handy especially when optimizing a certain algorithm or section of your code -- you just insert a line of code that specifies what you want to measure and run your code: initialization and output are automated.

Profiling and timemory

Timemory is not a full profiler (yet). The ultimate goal is to create a customizable profiler. Currently, timemory supports explicit instrumentation (i.e. minor modifications to source code) and explicit wrapping of dynamically-linked functions. Using profilers are currently important for discovering where to place timemory markers or which dynamically function calls to wrap with GOTCHA. The library provides an easy-to-use method for always-on general HPC analysis metrics (i.e. timing, memory usage, etc.) with the same or less overhead than if these metrics were to records and stored in a custom solution and, for C++ code, extensively inlined. Functionally, the overhead is non-existant: sampling profilers (e.g. gperftools, VTune) at standard sampling rates barely notice the presence of timemory unless it is been used very unwisely.

Additional tools are provided, such as hardware counters, to increase optimization productivity. What to check whether those changes increased data locality (i.e. decreased cache misses) but don't care about any other sections of the code? Use the following and set TIMEMORY_PAPI_EVENTS="PAPI_L1_TCM,PAPI_L2_TCM,PAPI_L3_TCM" in the environment:

using auto_tuple_t = tim::auto_tuple<tim::component::papi_array_t>;
TIMEMORY_CALIPER(roi, auto_tuple_t, "");
//
// do something in region of interest...
//
TIMEMORY_CALIPER_APPLY(roi, stop);

and delete it when finished. It's three extra LOC that may reduce the time spent: changing code, then runnning profiler, then opening output in profiler, then finding ROI, then comparing to previous results, and then repeating from 4 hours to 1.

In general, profilers are not run frequently enough and performance degradation or memory bloat can go undetected for several commits until a production run crashes or underperforms. This generally leads to a scramble to detect which revision caused the issue. Here, timemory can decrease performance regression identification time. When timemory is combined with a continuous integration reporting system, this scramble can be mitigated fairly quickly because the high-level reporting provided allows one to associate a region and commit with exact performance numbers. Once timemory has been used to help identify the offending commit and identify the general region in the offending code, a full profiler should be launched for the fine-grained diagnosis.

Create Your Own Tools/Components

There are numerous instrumentation APIs available but very few provide the ability for users to create tools/components that will fully integrate with the instrumentation API in their code. The simplicity of creating a custom component that inherits category-based formatting properties (is_timing_category) and timing unit conversion (uses_timing_units) can be easily demonstrated in ~50 LOC with the wall_clock component:

namespace tim
{
namespace component { struct wall_clock; }

namespace trait
{
template <> struct is_timing_category<component::wall_clock> : std::true_type {};
template <> struct uses_timing_units<component::wall_clock> : std::true_type {};
}  // namespace trait

namespace component
{
//
// the system's real time (i.e. wall time) clock, expressed as the
// amount of time since the epoch.
//
struct wall_clock : public base<wall_clock, int64_t>
{
    using ratio_t    = std::nano;
    using value_type = int64_t;
    using base_type  = base<wall_clock, value_type>;

    static std::string label() { return "wall"; }
    static std::string description() { return "wall time"; }
    static value_type  record()
    {
        return tim::get_clock_real_now<int64_t, ratio_t>();
    }

    double get_display() const { return get(); }
    double get() const
    {
        auto val = (is_transient) ? accum : value;
        return static_cast<double>(val) / ratio_t::den * get_unit();
    }

    void start()
    {
        set_started();
        value = record();
    }

    void stop()
    {
        auto tmp = record();
        accum += (tmp - value);
        value = std::move(tmp);
        set_stopped();
    }
};

}  // namespace component
}  // namespace tim

GOTCHA and timemory

C++ codes running on the Linux operating system can take advantage of the built-in GOTCHA functionality to insert timemory markers around external function calls. GOTCHA is similar to LD_PRELOAD but operates via a programmable API. This include limited support for C++ function mangling (in general, mangling template functions are not supported -- yet).

Writing a GOTCHA hook in timemory is greatly simplified and applications using timemory can specify their own GOTCHA hooks in a few lines of code instead of being restricted to a pre-defined set of GOTCHA hooks.

Example GOTCHA

If an application wanted to insert tim::auto_timer around (unmangled) MPI_Allreduce and (mangled) ext::do_work in the following executable:

#include <mpi.h>
#include <vector>

int main(int argc, char** argv)
{
    init();

    MPI_Init(&argc, &argv);

    int sizebuf = 100;
    std::vector<double> sendbuf(sizebuf, 1.0);
    // ... do some stuff
    std::vector<double> recvbuf(sizebuf, 0.0);

    MPI_Allreduce(sendbuf.data(), recvbuf.data(), sizebuf, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    // ... etc.

    int64_t nitr = 10;
    std::pair<float, double> settings{ 1.25f, 2.5 };
    std::tuple<float, double> result = ext::do_work(nitr, settings);
    // ... etc.

    return 0;
}

This would be the required specification using the TIMEMORY_C_GOTCHA macro for unmangled functions and TIMEMORY_CXX_GOTCHA macro for mangled functions:

#include <timemory/timemory.hpp>

static constexpr size_t NUM_FUNCS = 2;
using gotcha_t = tim::component::gotcha<NUM_FUNCS, tim::auto_timer_t>;

void init()
{
    TIMEMORY_C_GOTCHA(gotcha_t, 0, MPI_Allreduce);
    TIMEMORY_CXX_GOTCHA(gotcha_t, 1, ext::do_work);
}

Additional Information

For more information, refer to the documentation.