Distances
Distances is a high performance Nim library for calculating distances.
This library is designed to allow users to calculate common distance metrics across all the popular sequence based libraries in Nim.
Supported Libraries
Current supported sequence based libraries include:
Supporting Distance Metrics
Current supported distance metrics include:
Distance | Command (seq | tensor | vector)* |
---|---|
Hamming | hamming_distance_*(x1, x2) |
Euclidean | euclidean_distance_*(x1, x2) |
Squared Euclidean | sqeuclidean_distance_*(x1, x2) |
City Block | cityblock_distance_*(x1, x2) |
Total Variation | totalvariation_distance_*(x1, x2) |
Jaccard | jaccard_distance_*(x1, x2) |
Cosine | cosine_distance_*(x1, x2) |
KL Divergence | kldivergence_distance_*(x1, x2) |
Examples
Calculating Cosine Distance - sequtils
Note: All computations are done row-wise.
import sequtils
import distances/seq
let
num_rows = 100
num_cols = 100
input_seq_int = newSeq[int](num_cols)
input_seq_seq_int = newSeqWith(num_rows, newSeq[int](num_cols))
# 1D distance
echo cosine_distance_seq(input_seq_int, input_seq_int)
# 2D distance (Pairwise)
echo pairwise_seq_seq(input_seq_seq_int, cosine_distance_seq[typeof(input_seq_seq_int[0][0])])
Calculating Cosine Distance - arraymancer
Note: Only 2D Tensors are supported. All computations are done column wise.
import arraymancer
import distances/tensor
let
num_rows = 100
num_cols = 100
input_tensor_1d_int = zeros[int](1, num_cols)
input_tensor_2d_int = zeros[int](num_rows, num_cols)
# 1D distance
echo cosine_distance_tensor(input_tensor_1d_int, input_tensor_1d_int)
# 2D distance (Pairwise)
echo pairwise_tensor(input_tensor_2d_int, cosine_distance_tensor[typeof(input_tensor_2d_int[0, 0])])
Calculating Cosine Distance - neo
Note: All computations are done column wise. Warning: Neo matrices seem to run 1-2 order of magnitudes slower than sequtils and arraymancer. Please contact me or submit a PR if you know why.
import neo
import distances/vector
let
num_rows = 100
num_cols = 100
input_vector_int = makeVector(num_cols, proc(i: int): int = 0)
input_matrix_int = makeMatrix(num_rows, num_cols, proc(i, j: int): int = 0)
# 1D distance
echo cosine_distance_vector(input_vector_int, input_vector_int)
# 2D distance (Pairwise)
echo pairwise_matrix(input_matrix_int, cosine_distance_vector[typeof(input_matrix_int[0, 0])])
Normalization
All distance metrics support the optional normalize
(defaults to false
) parameter. This normalizes distance outputs (between -1 and 1). Note, while all distance metrics have this parameter only, it will do nothing for jaccard, cosine, and KL divergence distances.
E.g.
discard cosine_distance_vector(input_vector_int, input_vector_int, normalize=true)
pairwise_matrix(input_matrix_int, cosine_distance_vector[typeof(input_matrix_int[0, 0])], normalize=true)
Symmetry
The pairwise_*
procs always compute the lower left triangle of the 2D sequence to save time. To get a full matrix, use the symmetrize_*(X, how: string = "l=>u")
proc.
E.g.
discard symmetrize_seq_seq(X, "l=>u") # Copy lower left triangle to upper right triangle
discard symmetrize_seq_seq(X, "u=>l") # Copy upper right triangle to lower left triangle
Performance
To get optimal performance, here are the recommended compiler flags:
nim --cc:gcc --passC:"-fopenmp -ffast-math" --passL:"-fopenmp -ffast-math" --d:release c -r myScript.nim
.
- openmp -> Multiprocessing for the
pairwise_*
procs. Number of threads is equal to ENV variableOMP_NUM_THREADS
. - ffast-math -> ~2x float multiplication speedups
- d:release -> ~100x pairwise speedup
- d:danger -> ~120x pairwise speedup
TODO
- Neo matrices seem to run 1-2 orders of magnitudes slower than sequtils and arraymancer. The reason is unknown to me.
- Add more distance metrics
- Add support for distance metrics with more than 2 arguments
Performance, feature, and documentation PR's are always welcome.
Contact
I can be reached at aymanalbaz98@gmail.com