perf-analysis

high-resolution performance benchmarking


Keywords
library, performance, program, Propose Tags , Perf.Analysis
License
BSD-3-Clause
Install
cabal install perf-analysis-0.0.1.1

Documentation

perf

https://img.shields.io/hackage/v/perf.svg https://github.com/tonyday567/perf/workflows/haskell-ci/badge.svg

Introduction

perf provides some ideas, code and a library for low-level performance measurement for Haskell hacking. The library:

  • provides a monad transformer, PerfT, as a light-weight wrapper for use on existing code. PerfT modifications can be included in code bases, as opposed to performance being separated code and process, with any effects able to be erased at compile time with evalPerfT.
  • focuses on using the rdtsc instruction set that comes with the x86 chip to measure accurate time performance as number of clock cycles.
  • abstracts the concept of what is being measured, so that concepts such as counters, debug checks, time and space performance can be treated equivalently.

Setup

Note that running this readme.org is very slow compared with an external process which accesses the compiled version of the library.

:r
:set -Wno-type-defaults
:set -Wno-unused-do-bind
:set -Wno-name-shadowing
:set -XOverloadedStrings
:set -XOverloadedLabels
import Perf
import Data.FormatN
import qualified Data.Text as Text
import qualified Data.Text.IO as Text
import qualified Data.Map.Strict as Map
import Control.Monad
import Data.Bifunctor
putStrLn "ok"

Time

What is a tick?

The fundamental operation of Perf.Time is tick, which sandwiches a (strict) function application between two readings of the rdstc chip.

:t tick
tick :: (a -> b) -> a -> IO (Cycles, b)

tick returns in the IO monad, because reading a cycle counter is an IO effect. The trivial but fundamental point is that performance measurement effects the computation being measured.

Unlike benchmarking-style libraries, such as criterion, the result of the computation is passed through along with the cycle count. In this way, the `Perf` monad can be inserted into the midst of a computation in an attempt to measure performance in-situ as opposed to sitting off in a separate and perhaps unrepresentative process.

tick_

tick_ measures the number of cycles between two clock reads.

:t tick_
tick_ :: IO Cycles
fmap word <$> replicateM 10 tick_
[3063,1201,990,978,945,948,951,948,951,948]

multiple ticks

fmap (fmap (word . fst)) . replicateM 10 $ tick (const ()) ()
[12104,5040,4868,4508,4188,4172,4800,4748,5760,4216]

Here, const () () was evaluated and took 6.1k cycles for the first effect, reducing down to 2.6k after 10 effects. What it did in that time we will never know, really, at least from usage of perf. 2.2k cycles is, on my 2.5GHz machine equal to 2200 / 2.5e9 = 8.8e-7 of a second or 880 nanoseconds.

tickIO

tickIO measures the evaluation of an IO value.

:t tickIO
tickIO :: IO a -> IO (Cycles, a)
fmap (fmap (word . fst)) . replicateM 10 $ tickIO (pure ())
[10084,3588,4104,2844,2916,3140,2800,2848,2848,2804]

sum example

fmap (expt (Just 2) . fromIntegral) . fst <$> ticks 10 sum ([1..10000] :: [Double])
["1.7e6","6.1e5","6.2e5","5.4e5","4.3e5","4.1e5","4.1e5","4.3e5","4.1e5","4.1e5"]
ts <- ticks 10000 sum ([1..1000] :: [Double])
print $ average (fmap fromIntegral $ fst ts)
30411.6957

PerfT

PerfT allows for multiple measurement points and is polymorphic in what is being measured. It returns a Map of results held in State.

Compare a lower-level usage of ticks, measuring the average of summing to one thousand over one thousand trials:

first (average . fmap fromIntegral) <$> ticks 1000 sum [1..1000]
(50914.804,500500)

… with PerfT usage

second (fmap (average . fmap fromIntegral)) <$> runPerfT (times 1000) (sum |$| [1..1000])
(500500,fromList [("",44362.311)])

Comparing performance of sum versus a list fusion approach:

fmap (average . fmap fromIntegral) <$> (execPerfT (times 1000) $ do; (fap "sum" sum [1..1000]); (fap "fusion" (\x -> sum [1..x]) 1000))
fromList [("fusion",97804.085),("sum",45053.771)]

An IO example

exampleIO' :: IO ()
exampleIO' = do
  txt <- Text.readFile "src/Perf.hs"
  let n = Text.length txt
  Text.putStrLn $ "length of file is: " <> Text.pack (show n)
exampleIO = execPerfT time (do
  txt <- fam "file_read" (Text.readFile "src/Perf.hs")
  n <- fap "length" Text.length txt
  fam "print_result" (Text.putStrLn $ "length of file is: " <> Text.pack (show n)))
perf-explore --exampleIO
length of file is: 1794
length of file is: 1794

label1          label2          label3          old result      new result      change

normal          file-read       time            5.96e5          6.58e5          slightly-degraded
normal          length          time            1.10e4          7.12e3          improvement
normal          print-result    time            1.44e5          5.70e4          improvement
outer           file-read       time            1.36e5          1.39e5
outer           length          time            1.45e3          1.33e3          improvement
outer           outer-total     time            1.67e5          1.50e5          improvement
outer           print-result    time            2.70e4          6.26e3          improvement

perf-explore

perf-explore contains some exploratory routines used to develop perf

perf-explore --help
examples of perf usage

Usage: perf-explore [-n|--runs ARG]
                    [--best | --median | --average | --averagesecs]
                    [--time | --space | --spacetime | --allocation | --count]
                    [-g|--golden ARG] [--nocheck] [-r|--record]
                    [--header | --noheader] [--error ARG] [--warning ARG]
                    [--improved ARG]
                    [--sums | --lengths | --nub | --examples | --example |
                      --exampleIO | --noops | --ticks | --gauge]
                    [-l|--length ARG]
                    [--sumFuse | --sum | --lengthF | --constFuse | --mapInc |
                      --noOp]

  perf exploration

Available options:
  -n,--runs ARG            number of runs to perform
  --best                   report upper decile
  --median                 report median
  --average                report average
  --averagesecs            report average in seconds
  --time                   measure time performance
  --space                  measure space performance
  --spacetime              measure both space and time performance
  --allocation             measure bytes allocated
  --count                  measure count
  -g,--golden ARG          golden file name
  --nocheck                do not check versus the golden file
  -r,--record              record the result to the golden file
  --header                 include headers
  --noheader               dont include headers
  --error ARG              error level
  --warning ARG            warning level
  --improved ARG           improved level
  --sums                   run on sum algorithms
  --lengths                run on length algorithms
  --nub                    nub test
  --examples               run on example algorithms
  --example                run on the example algorithm
  --exampleIO              exampleIO test
  --noops                  noops test
  --ticks                  tick test
  --gauge                  gauge runs on exmaple for comparison
  -l,--length ARG          length of list
  --sumFuse                fused sum pipeline
  --sum                    sum
  --lengthF                foldr id length
  --constFuse              fused const pipeline
  --mapInc                 fmap (+1)
  --noOp                   const ()
  -h,--help                Show this help text
fmap averageI <$> execPerfT (times 10000) (sum |$| [1..1000])
fromList [("",35729.9518)]

The equivalent to the above code is:

perf-explore -n 10000 -l 1000 --sum --nocheck
label1          label2          results

:

sum             time            1.47e4

noops

This no-op experiment is useful to understand the pure time performance of the machinery around measurement. It can be (re)run with:

perf-explore --noops
label1          label2          label3          results

const           1st             time            6.05e3
const           2nd             time            2.80e1
const           3rd             time            3.60e1
const           4th             time            3.20e1
const           average         time            4.11e1
const           best            time            2.74e1
const           median          time            3.54e1
pure            1st             time            6.46e3
pure            2nd             time            7.60e1
pure            3rd             time            4.00e1
pure            4th             time            4.00e1
pure            average         time            4.28e1
pure            best            time            2.82e1
pure            median          time            3.64e1
No golden file found. To create one, run with '-r'

measurement context

Exploration of how the code surrounding measurement effects performance.

perf-explore -n 1000 -l 1000 --ticks --nocheck
stepTime tick tickForce tickForceArgs tickLazy tickWHNF times
sumAux 1.46e4 1.51e4 1.65e4 1.45e4 1.96e1 1.45e4 1.46e4
sumCata 1.45e4 1.82e4 1.45e4 1.45e4 1.99e1 1.46e4 1.48e4
sumCo 1.45e4 1.46e4 1.46e4 1.63e4 2.02e1 1.45e4 1.44e4
sumCoCase 1.44e4 1.51e4 1.61e4 1.46e4 2.78e1 2.10e4 1.49e4
sumCoGo 2.61e4 1.46e4 3.82e4 2.55e4 2.04e1 1.45e4 2.59e4
sumF 1.34e4 1.34e4 2.01e4 1.64e4 2.01e1 1.33e4 1.33e4
sumFlip 1.37e4 1.34e4 1.34e4 1.33e4 2.06e1 1.41e4 1.33e4
sumFlipLazy 1.33e4 1.34e4 1.33e4 1.34e4 2.01e1 1.37e4 1.34e4
sumFoldr 1.44e4 1.45e4 2.18e4 1.68e4 3.22e1 1.54e4 1.45e4
sumFuse 3.19e3 1.98e3 1.93e3 2.19e3 1.98e1 1.83e3 3.15e3
sumFuseFoldl’ 1.88e3 2.46e3 2.02e3 2.27e3 2.39e1 2.38e3 2.15e3
sumFuseFoldr 5.79e3 6.01e3 3.70e3 3.70e3 2.00e1 3.70e3 6.11e3
sumFusePoly 2.67e3 2.68e3 2.65e3 2.29e3 2.73e1 2.44e3 2.60e3
sumLambda 1.59e4 1.35e4 1.47e4 1.99e4 1.99e1 1.36e4 1.34e4
sumMono 1.38e4 1.69e4 1.35e4 1.36e4 2.04e1 1.37e4 1.34e4
sumPoly 1.99e4 1.42e4 1.35e4 1.73e4 1.97e1 1.35e4 1.51e4
sumSum 1.34e4 1.36e4 1.34e4 1.35e4 1.99e1 1.34e4 2.01e4
sumTail 2.03e4 1.37e4 1.34e4 1.34e4 1.99e1 1.34e4 2.01e4
sumTailLazy 1.34e4 1.66e4 1.33e4 1.41e4 1.99e1 1.35e4 1.33e4

short list

perf-explore -n 10000 -l 10 --best --ticks
stepTime tick tickForce tickForceArgs tickLazy tickWHNF times
sumAux 1.81e2 2.33e2 1.81e2 1.82e2 1.32e1 1.83e2 1.82e2
sumCata 1.81e2 2.33e2 1.81e2 1.82e2 1.33e1 2.33e2 1.81e2
sumCo 2.10e2 2.33e2 2.13e2 2.11e2 2.49e1 2.33e2 2.10e2
sumCoCase 1.81e2 2.47e2 1.82e2 1.82e2 1.83e1 2.47e2 1.81e2
sumCoGo 1.81e2 2.34e2 2.34e2 2.34e2 1.33e1 2.33e2 1.82e2
sumF 1.03e2 2.22e2 1.05e2 1.04e2 1.88e1 1.83e2 1.04e2
sumFlip 1.13e2 1.28e2 1.20e2 1.20e2 1.88e1 1.17e2 1.63e2
sumFlipLazy 1.13e2 2.07e2 1.76e2 1.20e2 2.49e1 2.04e2 1.01e2
sumFoldr 1.81e2 2.66e2 2.65e2 1.81e2 1.88e1 2.65e2 1.81e2
sumFuse 2.58e1 2.71e1 2.85e1 3.36e1 1.36e1 3.36e1 3.42e1
sumFuseFoldl’ 5.04e1 2.71e1 5.11e1 5.08e1 1.41e1 2.65e1 3.96e1
sumFuseFoldr 8.16e1 8.61e1 6.26e1 7.34e1 1.25e1 8.47e1 8.54e1
sumFusePoly 4.19e1 5.82e1 4.26e1 5.81e1 1.30e1 5.91e1 4.30e1
sumLambda 1.39e2 1.30e2 1.60e2 1.63e2 1.33e1 1.29e2 1.43e2
sumMono 8.04e1 1.43e2 8.64e1 8.41e1 1.83e1 1.15e2 8.16e1
sumPoly 1.16e2 1.29e2 1.44e2 1.21e2 2.49e1 1.30e2 1.18e2
sumSum 1.02e2 1.03e2 1.04e2 1.05e2 1.33e1 1.02e2 1.04e2
sumTail 1.40e2 1.37e2 1.31e2 1.31e2 1.36e1 1.45e2 1.42e2
sumTailLazy 1.06e2 1.78e2 1.12e2 1.12e2 1.84e1 2.25e2 1.09e2

long list

perf-explore -n 100 -l 100000 --best --ticks
stepTime tick tickForce tickForceArgs tickLazy tickWHNF times
sumAux 2.82e6 2.70e6 2.66e6 2.75e6 1.83e1 2.70e6 2.79e6
sumCata 2.71e6 2.69e6 2.63e6 2.64e6 1.28e1 2.68e6 2.75e6
sumCo 2.71e6 2.73e6 2.63e6 2.64e6 2.52e1 2.71e6 2.73e6
sumCoCase 2.69e6 2.70e6 2.70e6 2.65e6 1.85e1 2.76e6 2.67e6
sumCoGo 2.68e6 2.77e6 2.65e6 2.63e6 1.82e1 2.75e6 2.69e6
sumF 1.24e6 1.24e6 1.24e6 1.24e6 1.26e1 1.24e6 1.27e6
sumFlip 1.16e6 1.14e6 1.14e6 1.16e6 1.35e1 1.13e6 1.16e6
sumFlipLazy 1.03e6 1.03e6 1.03e6 1.02e6 1.37e1 1.03e6 1.03e6
sumFoldr 2.76e6 2.83e6 2.66e6 2.67e6 1.82e1 2.81e6 2.74e6
sumFuse 1.45e5 1.45e5 1.45e5 1.44e5 1.39e1 1.45e5 1.45e5
sumFuseFoldl’ 1.45e5 1.45e5 1.45e5 1.46e5 1.35e1 1.45e5 1.45e5
sumFuseFoldr 1.76e6 1.77e6 1.75e6 1.75e6 1.90e1 1.76e6 1.75e6
sumFusePoly 1.45e5 2.00e5 1.45e5 1.45e5 1.32e1 1.68e5 1.45e5
sumLambda 9.35e5 9.52e5 9.35e5 9.38e5 2.48e1 9.46e5 9.44e5
sumMono 1.13e6 1.12e6 1.11e6 1.11e6 1.23e1 1.11e6 1.11e6
sumPoly 1.03e6 1.04e6 1.02e6 1.03e6 1.32e1 1.03e6 1.03e6
sumSum 1.20e6 1.22e6 1.20e6 1.19e6 1.32e1 1.22e6 1.20e6
sumTail 1.22e6 1.22e6 1.22e6 1.22e6 1.28e1 1.22e6 1.22e6
sumTailLazy 1.16e6 1.19e6 1.18e6 1.16e6 1.29e1 1.18e6 1.16e6

sums

perf-explore -n 1000 -l 1000 --sums
label1          label2          old result      new result      change

sumAux          time            1.51e4          1.46e4
sumCata         time            1.55e4          1.46e4          improvement
sumCo           time            1.55e4          2.01e4          degraded
sumCoCase       time            1.57e4          1.47e4          improvement
sumCoGo         time            1.54e4          1.69e4          slightly-degraded
sumF            time            1.33e4          2.02e4          degraded
sumFlip         time            1.41e4          1.36e4
sumFlipLazy     time            1.47e4          1.93e4          degraded
sumFoldr        time            1.56e4          1.75e4          slightly-degraded
sumFuse         time            1.66e3          2.61e3          degraded
sumFuseFoldl'   time            1.71e3          2.70e3          degraded
sumFuseFoldr    time            5.83e3          6.72e3          slightly-degraded
sumFusePoly     time            2.12e3          2.64e3          degraded
sumLambda       time            1.47e4          1.39e4          improvement
sumMono         time            1.73e4          1.40e4          improvement
sumPoly         time            1.87e4          1.36e4          improvement
sumSum          time            1.33e4          1.34e4
sumTail         time            1.47e4          1.40e4          improvement
sumTailLazy     time            1.44e4          1.34e4          improvement

lengths

perf-explore -n 1000 -l 1000 --lengths
label1          label2          old result      new result      change

lengthAux       time            1.45e4          1.60e4          slightly-degraded
lengthCo        time            1.64e4          1.39e4          improvement
lengthCoCase    time            1.78e4          1.31e4          improvement
lengthF         time            1.06e4          1.10e4
lengthFMono     time            1.05e4          1.05e4
lengthFlip      time            1.11e4          1.07e4
lengthFlipLazy  time            1.13e4          1.07e4          improvement
lengthFoldr     time            1.27e4          1.33e4          slightly-degraded
lengthFoldrConsttime            1.27e4          1.36e4          slightly-degraded
lengthTail      time            1.08e4          1.10e4
lengthTailLazy  time            1.15e4          1.03e4          improvement

Gauge comparison

perf-explore -n 1000 -l 1000 --average --gauge
sumFuse
benchmarking function ... 
function                                 time                 849.4 ns

benchmarking function ... 
function                                 time                 848.6 ns

sum
benchmarking function ... 
function                                 time                 3.786 μs

benchmarking function ... 
function                                 time                 3.791 μs

lengthF
benchmarking function ... 
function                                 time                 1.863 μs

benchmarking function ... 
function                                 time                 1.872 μs

constFuse
benchmarking function ... 
function                                 time                 571.0 ns

benchmarking function ... 
function                                 time                 1.080 μs

mapInc
benchmarking function ... 
function                                 time                 9.203 ns

benchmarking function ... 
function                                 time                 15.99 μs

noop
benchmarking function ... 
function                                 time                 5.770 ns

benchmarking function ... 
function                                 time                 4.740 ns

Space

Data is collected from GHCStats

  • allocated_bytes
  • gcs
  • gcdetails_live_bytes
  • max_live_bytes
  • max_mem_in_use_bytes
perf-explore -n 10 -l 100000 --space +RTS -T -RTS
label1          label2          results

:

sum             MaxMem          4.61e6
sum             allocated       4.19e5
sum             gcLiveBytes     2.19e5
sum             gcollects       1.00e-1
sum             maxLiveBytes    0.00e0
No golden file found. To create one, run with '-r'

Perf.BigO

Perf.BigO represents functionality to determine the complexity order for a computation.

We could do a regression and minimise the error term, but we know that the largest run contains the most information; we would need to weight the simulations according to some heuristic.

Instead, we:

  • estimate the order factor for each possible Order, from N3 to N0, setting the highest n run constant factor to zero,
  • pick the order based on lowest absolute error result summed across all the runs,
import qualified Prelude as P
import Data.List (nub)
estOrder (\x -> sum $ nub [1..x]) 10 [1,10,100,1000]
BigOrder {bigOrder = N2, bigFactor = 13.375059, bigConstant = 0.0}

References

The Haskell performance checklist

ndmitchell/spaceleak: Notes on space leaks

Core

5.13. Debugging the compiler

ghc app/speed.hs -ddump-simpl -ddump-to-file -fforce-recomp -dlint -O

haskell wiki: Looking at the Core

godbolt

ghc issue 15185: Enum instance for IntX / WordX are inefficient

fixpt - All About Strictness Analysis (part 1)

Profiling

setup

8. Profiling

A typical configuration step for profiling:

cabal configure --enable-library-profiling --enable-executable-profiling -fprof-auto -fprof -write-ghc-environment-files=always

A cabal.project.local with profiling enabled:

write-ghc-environment-files: always ignore-project: False flags: +prof +prof-auto library-profiling: True executable-profiling: True

Examples from markup-parse R&D:

Executable compilation:

ghc -prof -fprof-auto -rtsopts app/speed0.hs -threaded -fforce-recomp

Executable run:

app/speed0 +RTS -s -p -hc -l -RTS

Space usage output (-s)

885,263,472 bytes allocated in the heap
       8,507,448 bytes copied during GC
         163,200 bytes maximum residency (4 sample(s))
          27,752 bytes maximum slop
               6 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       207 colls,     0 par    0.009s   0.010s     0.0001s    0.0002s
  Gen  1         4 colls,     0 par    0.001s   0.001s     0.0004s    0.0005s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.006s  (  0.006s elapsed)
  MUT     time    0.367s  (  0.360s elapsed)
  GC      time    0.010s  (  0.011s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time    0.384s  (  0.380s elapsed)

Cost center profile (-p)

Dumped to speed0.prof

COST CENTRE MODULE                SRC                                            %time %alloc

token       MarkupParse           src/MarkupParse.hs:(259,1)-(260,20)             50.2   50.4
wrappedQ'   MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:(215,1)-(217,78)   20.8   23.1
ws_         MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:(135,1)-(146,4)    14.3    5.5
eq          MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:243:1-30           10.6   11.1
gather      MarkupParse           src/MarkupParse.hs:(420,1)-(428,100)             2.4    3.7
runParser   FlatParse.Basic       src/FlatParse/Basic.hs:(217,1)-(225,24)          1.0    6.0

heap analysis (-hc -l)

eventlog2html speed0.eventlog

Produces speed0.eventlog.html which contains heap charts.

Cache speed

The average cycles per + operation can get down to about 0.7 cycles, and there are about 4 cache registers per cycle, so a sum pipeline uses 2.8 register instructions per +.

Cache Cycles
register 4 per cycle
L1 Cache access 3-4 cycles
L2 Cache access 11-12 cycles
L3 unified access 30 - 40
DRAM hit 195 cycles
L1 miss 40 cycles
L2 miss >600 cycles