ggplotnim

A port of ggplot2 for Nim


Keywords
library, grammar of graphics, gog, ggplot2, plotting, graphics, data-visualization, hacktoberfest, nim-lang, plot
License
MIT
Install
nimble install ggplotnim

Documentation

ggplotnim - ggplot2 in Nim

This package, as the name suggests, will become a “sort of” port of ggplot2 for Nim.

It is based on the ginger package.

Currently it is in a purely prototyping state. The code as it is only a proof of concept for myself to see whether ginger is at a point where it’s technically feasible to draw ggplot2 like plots and to see how well the syntax can be ported to Nim.

On the plus side, thanks to Nim’s macro system, even the ~ function syntax works already, so that one can create a plot like so:

let plt = ggplot(mpg, aes(displ ~ cty / hwy)) +
  geom_point() 

would create a plot of displacement vs the ratio of city to highway mpg. All identifiers appearing in the formula are taken to be strings, which should appear in the data frame we give to ggplot (currently it’s just using Table[string, seq[string]].

The formula mentioned will be stored as (~ displ (/ cty hwy)) and a proc can be used to apply the mathematical functions in the correct order to a data frame given. Although at the moment the input to aes is not implemented, but for a working proof of concept, check out the tests.

Dependencies

For anyone brave enough to try to run this code at the moment, a few words on dependencies.

My fork of seqmath is required: https://github.com/vindaar/seqmath

The cairo wrapper:

nimble install cairo

And a branch of chroma for HCL support (to calculate the ggplot2 colors). Once the PR is merged, the latest version of chroma will be fine: https://github.com/treeform/chroma/pull/4

With these the code should hopefully compile just fine.

Currently working features

Geoms:

  • geom_point
  • geom_line
  • geom_histogram
  • geom_freqpoly
  • geom_bar

Facets:

  • facet_wrap

Scales:

  • size (both for discrete and continuous data)
  • color (both for discrete and continuous data)

Shape as a scale is not properly implemented, simply because ginger only provides 2 (circle, cross) different marker shapes so far. Feel free to add more!

Data frame

The library implements a naive data frame, which provides the “5 verbs” of dplyr. Implemented functions:

  • filter
  • mutate, transmute
  • select, rename
  • arrange
  • summarize

and also group_by, which are all based on the FormulaNode object. Basically they all receive varargs[FormulaNode], which is evaluated in context of the given dataframe.

Creationg of a FormulaNode can be done either directly via untyped templates acting on +, -, *, /, ~. Using the mpg data set as an example:

let f = displ ~ hwy / cty

would describe the dependence of the displacement (displ) of the ratio of the highway to the freeway mpg. Echoeing this formula prints it as a lisp like tree:

(~ displ (/ hwy cty))

Note that the ~ in the untyped templates always acts as the root node of the resulting tree. The LHS of it is always considered the dependend quantity. In these templates however, the identifiers are converted to strings and must match the names in the data frame!

f{} macro to create formulas

The second way to create a FormulaNode is via the f{} macro. This provides a little more flexibility:

let f = f{ "displ" ~ "hwy" / mean("cty") }

Note that here all keys must be explicit strings. Everything that is not a string, will be interepreted in the calling scope.

If the identifier is the first element of a nnkCall, e.g. as in mean("cty"), it will be stored in a FormulaNode of kind fkFunction. An fkFunction itself may contain two different kinds of functions, as evident by the implementation:

# storing a function to be applied to the data
fnName: string
arg: FormulaNode
case fnKind*: FuncKind
of funcVector:
  fnV: proc(s: PersistentVector[Value]): Value
  res: Option[Value] # the result of fn(arg), so that we can cache it
                     # instead of recalculating it for every index potentially
of funcScalar:
  fnS: proc(s: Value): Value

We store the name of the function as a string for debugging and echoeing. The function must only take a single argument (this may be changed in the future / we may wrap a function with multiple arguments in a template in the future). It can either be a procedure taking a vector of Values corresponding to a proc working on a whole column as the input (e.g. mean) or a scalar function taking a single Value (e.g. abs). In the latter case the function is applied to each index of the key of the data frame given by arg.

Lifting templates are provided to lift any:

  • liftVector[T]Proc: proc (s: seq[T]): T proc to proc(s: PersistentVector[Value]): Value
  • liftScalar[T]Proc: proc (s: T): T proc to proc(s: Value): Value

where T may be float, int, string.

The PersistentVector is an implementation detail of the data frame at the moment and may be changed back to seq soon.

On the other hand if an identifier is not part of a nnkCall it is interpreted as a variable declared in the calling scope and will be converted to a Value using % and stored as a fkVariable.

Literal interger and float values are also allowed.

Examples

Using a lifted vector valued function and local variables as keys and integer values:

let val = 1000
let key = "cty"
let f = f{"cty_norm" ~ "cty" / mean(key) * val}

Using a lifted scalar valued function and local variables as keys and float literal values for a random calculation:

let g = f{"cty_by_2ln_hwy" ~ "cty" / (ln("hwy") * 2)}

Examples

The following are just the first plots I reproduced. The mpg dataset being used has to be read via the readCsv proc and be converted to a dataframe via toDf. The file is located in data/mpg.csv part of the repository. So the header of all examples below is simply:

import ggplotnim

let mpg = toDf(readCsv("data/mpg.csv"))

where it is assumed the current working directory is the ggplotnim dir.

Scatter of displ ~ hwy

Simple scatter plot of two quantities =”displ”= vs. =”hwy”= of a dataframe.

ggplot(mpg, aes(x = "displ", y = "hwy")) +
  geom_point() + 
  ggsave("scatter.pdf")

Note: if the ggsave call is omitted, the return value will be a GgPlot object, which can either be inspected or modified or called upon with ggsave at a later time.

media/scatter.png

Scatter of displ ~ hwy, class as color scale

Same scatter plot as above, but with a grouping by a third quantity =”class”= encoded in the dot color. Also adds a title to the plot.

ggplot(mpg, aes(x = "displ", y = "cty", color = "class")) +
  geom_point() +
  ggtitle("ggplotnim - or I Suck At Naming Things™") +
  ggsave("scatterColor.pdf")

media/scatterColor.png

Filtering data frame before plotting

We may now also perform some operations on the data frame, before we plot it. For instance we can filter on a string (or a number) and perform calculations on columns:

df.filter(f{"class" == "suv"}) # comparison via `f{}` macro
  .mutate(ratioHwyToCity ~ hwy / cty # raw untyped template function definition
  ) # <- note that we have to use normal UFCS to hand to `ggplot`!
  .ggplot(aes(x = "ratioHwyToCity", y = "displ", color = "class")) + 
  geom_point() +
  ggsave("scatterFromDf.pdf")

And eeehm, I guess the legend is broken if we only have a single entry…

media/scatterFromDF.png

Mutating via local procedure

In addition we can use locally defined procedures in the f{} macro as well (see above for caveats). For instance we can normalize a column by dividing by the mean:

df.mutate(f{"cty_norm" ~ "cty" / mean("cty")}) # divide cty by mean
  .ggplot(aes(x = "displ", y = "cty_norm", color = "class")) +
  geom_point() +
  ggsave("classVsNormCty.pdf")

Note that calculations involving explicit numbers or constants is not supported yet. For that the implementation of FormulaNode must be changed to use Value as well.

media/classVsNormCty.png

Histogram of hwy

A simple histogram of one quantity =”hwy”= of a dataframe.

ggplot(mpg, aes("hwy")) +
  geom_histogram() +
  ggsave("simpleHisto.pdf")

media/simpleHisto.png

Frequency line plot

Same as the histogram above, but as a frequence line.

ggplot(mpg, aes("hwy")) +
  geom_freqpoly() +
  ggsave("freqpoly.pdf")

media/freqpoly.png

Combining several geoms, setting aesthetics of specific geoms

A combination of a histogram and a frequency line plot. Also showcases the ability to set aesthetics of specific geoms to a constant value (in this case change line width and color of the freqpoly line). Note that the order in which the geom_* functions are called is also the order in which they are drawn.

ggplot(mpg, aes("hwy")) +
  geom_histogram() +
  geom_freqpoly(color = parseHex("FD971F"),
                size = 3.0) +
  ggsave("histoPlusFreqpoly.pdf")

media/histoPlusFreqpoly.png

Facet wrap of manufacturer

Although still somewhat ugly, because the scaling is off, facet wrapping is working in principle:

ggplot(mpg, aes("displ", "hwy")) +
  geom_point(aes(color = "manufacturer")) +
  facet_wrap(~ class) +
  ggsave("facet_wrap_manufacturer.pdf")

media/facet_wrap_manufacturer.png

Simple bar plot

A simple bar plot of a variable with discrete data (typically a column of strings, bools or a small subset of ints).

ggplot(mpg, aes(x = "class")) +
  geom_bar() +
  ggsave("bar_example.pdf")

media/bar_example.png

Experimental Vega-Lite backend

From the beginning one of my goals for this library was to provide not only a Cairo backend, but also to support Vega-Lite (or possibly Vega) as a backend. To share plots and data online (and possibly add support for interactive features) is much easier in such a way.

For now only a proof of concept is implemented in vega_utils.nim. That is only geom_point with the =”x”=, =”y”=, =”color”= scale set on the main aesthetic are supported. Generalizing this is mostly a tediuos process, since the GgPlot object fields etc. have to be mapped to the appropriate Vega-Lite JSON nodes.

A simple example:

let vegaJson = ggplot(mpg, aes(x = "displ", y = "cty", color = "class")) +
  geom_point() +
  ggtitle("ggplotnim - or I Suck At Naming Things") +
  ggvega()
show(vegaJson)

creates the equivalent plot from above using Vega-Lite. Note that it still uses the Vega-Lite default theming.

It generates the following Vega-Lite JSON:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "description" : "Vega-lite plot created by ggplotnim",
  "width" : 640,
  "height" : 480,
  "title": "ggplotnim - or I Suck At Naming Things",
  "data": {"values" : [{"displ": 1.8, "cty": 18.0, "class": "compact"},
                       {"displ": 1.8, "cty": 21.0, "class": "compact"},
                       {"displ": 2.0, "cty": 20.0, "class": "compact"},
                       ... ]
]},
  "mark": "point",
  "encoding": {
    "x": {"field": "displ", "type": "quantitative"},
    "y": {"field": "cty", "type": "quantitative"},
    "color": {"field": "class", "type": "nominal"}
   }
}

And results in the following Vega-Lite plot:

media/vega_backend_example.png

Or if you want to look at the interactive version in your browser, see here:

Open in vega browser

Known issues / limitations

  • customization is very limited (font size, point sizes, line widths etc.). ginger provides the functionality, but it’s not exposed in gglpotnim atm. Extend Theme object for this, add args to procs where applicable.
  • log10 plots force x and y range to be of orders of 10
  • facet wrap layout is quite ugly still

Legends

  • legend is not always centered (easy to fix)
  • plots with two legends produce overlapping legends (easy to fix)
  • plots with continuous color scale produce no legend