distances

Distances is a high performance Nim library for calculating distances.


Keywords
math, statistics, metrics
License
Apache-2.0
Install
nimble install distances

Documentation

Distances

Distances is a high performance Nim library for calculating distances.

This library is designed to allow users to calculate common distance metrics across all the popular sequence based libraries in Nim.

Supported Libraries

Current supported sequence based libraries include:

  1. sequtils
  2. arraymancer
  3. neo

Supporting Distance Metrics

Current supported distance metrics include:

Distance Command (seq | tensor | vector)*
Hamming hamming_distance_*(x1, x2)
Euclidean euclidean_distance_*(x1, x2)
Squared Euclidean sqeuclidean_distance_*(x1, x2)
City Block cityblock_distance_*(x1, x2)
Total Variation totalvariation_distance_*(x1, x2)
Jaccard jaccard_distance_*(x1, x2)
Cosine cosine_distance_*(x1, x2)
KL Divergence kldivergence_distance_*(x1, x2)

Examples

Calculating Cosine Distance - sequtils

Note: All computations are done row-wise.

import sequtils
import distances/seq

let 
    num_rows = 100
    num_cols = 100
    input_seq_int = newSeq[int](num_cols)
    input_seq_seq_int = newSeqWith(num_rows, newSeq[int](num_cols))

# 1D distance
echo cosine_distance_seq(input_seq_int, input_seq_int)

# 2D distance (Pairwise)
echo pairwise_seq_seq(input_seq_seq_int, cosine_distance_seq[typeof(input_seq_seq_int[0][0])])

Calculating Cosine Distance - arraymancer

Note: Only 2D Tensors are supported. All computations are done column wise.

import arraymancer
import distances/tensor

let 
    num_rows = 100
    num_cols = 100
    input_tensor_1d_int = zeros[int](1, num_cols)
    input_tensor_2d_int = zeros[int](num_rows, num_cols)

# 1D distance
echo cosine_distance_tensor(input_tensor_1d_int, input_tensor_1d_int)

# 2D distance (Pairwise)
echo pairwise_tensor(input_tensor_2d_int, cosine_distance_tensor[typeof(input_tensor_2d_int[0, 0])])

Calculating Cosine Distance - neo

Note: All computations are done column wise. Warning: Neo matrices seem to run 1-2 order of magnitudes slower than sequtils and arraymancer. Please contact me or submit a PR if you know why.

import neo
import distances/vector

let 
    num_rows = 100
    num_cols = 100
    input_vector_int = makeVector(num_cols, proc(i: int): int = 0)
    input_matrix_int = makeMatrix(num_rows, num_cols, proc(i, j: int): int = 0)

# 1D distance
echo cosine_distance_vector(input_vector_int, input_vector_int)

# 2D distance (Pairwise)
echo pairwise_matrix(input_matrix_int, cosine_distance_vector[typeof(input_matrix_int[0, 0])])

Normalization

All distance metrics support the optional normalize (defaults to false) parameter. This normalizes distance outputs (between -1 and 1). Note, while all distance metrics have this parameter only, it will do nothing for jaccard, cosine, and KL divergence distances.

E.g.

discard cosine_distance_vector(input_vector_int, input_vector_int, normalize=true)
pairwise_matrix(input_matrix_int, cosine_distance_vector[typeof(input_matrix_int[0, 0])], normalize=true)

Symmetry

The pairwise_* procs always compute the lower left triangle of the 2D sequence to save time. To get a full matrix, use the symmetrize_*(X, how: string = "l=>u") proc.

E.g.

discard symmetrize_seq_seq(X, "l=>u")	# Copy lower left triangle to upper right triangle
discard symmetrize_seq_seq(X, "u=>l")	# Copy upper right triangle to lower left triangle

Performance

To get optimal performance, here are the recommended compiler flags: nim --cc:gcc --passC:"-fopenmp -ffast-math" --passL:"-fopenmp -ffast-math" --d:release c -r myScript.nim.

  • openmp -> Multiprocessing for the pairwise_* procs. Number of threads is equal to ENV variable OMP_NUM_THREADS.
  • ffast-math -> ~2x float multiplication speedups
  • d:release -> ~100x pairwise speedup
  • d:danger -> ~120x pairwise speedup

TODO

  • Neo matrices seem to run 1-2 orders of magnitudes slower than sequtils and arraymancer. The reason is unknown to me.
  • Add more distance metrics
  • Add support for distance metrics with more than 2 arguments

Performance, feature, and documentation PR's are always welcome.

Contact

I can be reached at aymanalbaz98@gmail.com