PyDistances: A Statistical Distances Python Package

This is a package for computing distances among observations of statistical variables, such as: Euclidean, Minkowski, Canberra, Pearson, Mahalanobis, Robust Mahalanobis, Gower, Generalized Gower and Related Metric Scaling (RelMS). A total of 41 statistical distances can be computed.

Installation

pip install PyDistances

Example of use

import PyDistances

from PyDistances import Euclidean_Dist, Euclidean_Dist_Matrix, Minkowski_Dist, Minkowski_Dist_Matrix, Canberra_Dist, Canberra_Dist_Matrix, Pearson_Dist, Pearson_Dist_Matrix, Mahalanobis_Dist, Mahalanobis_Dist_Matrix, a_b_c_d_Matrix, Sokal_Similarity, Sokal_Dist, Sokal_Dist_Matrix, Jaccard_Similarity, Jaccard_Dist, Jaccard_Dist_Matrix, alpha, Matching_Similarity, Matching_Dist, Matching_Dist_Matrix, Gower_Similarity_Matrix, Gower_Dist_Matrix, Robust_Mahalanobis_Dist, Robust_Mahalanobis_Dist_Matrix, GeneralizedGowerDistance

Getting data

We load the data we are going to work with throughout this tutorial. This data-set is available in the following link: https://github.com/FabioScielzoOrtiz/Distances_Package/blob/master/Tests/House_Price.csv

Data = pd.read_csv('House_Price.csv')

Data = Data.loc[0:150, ['latitude', 'longitude', 'price', 'size_in_m_2', 'balcony_recode', 'private_garden_recode', 'private_gym_recode', 'quality_recode', 'no_of_bathrooms', 'no_of_bedrooms']]

Data_quant = Data.loc[:,['latitude', 'longitude', 'price', 'size_in_m_2']]
Data_binary = Data.loc[:,['balcony_recode', 'private_garden_recode', 'private_gym_recode']]
Data_multiclass = Data.loc[:,['quality_recode', 'no_of_bathrooms', 'no_of_bedrooms']]

Data.head() # p1=4, p2=3, p3=3

latitude	longitude	price	size_in_m_2	balcony	quality	no_of_bathrooms	no_of_bedrooms
25.1132	55.1389	2.7e+06	100.242	1	2	2	1
25.1068	55.1512	2.85e+06	146.973	1	2	2	2
25.0633	55.1377	1.15e+06	181.254	1	2	5	3
25.2273	55.3418	2.85e+06	187.664	1	1	3	2
25.1143	55.1398	1.7292e+06	47.1018	0	2	1	0

Computing Euclidean distance

We compute the Euclidean distance between observation of index 0 and itself.

Euclidean_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:])

0.0

We compute the Euclidean distance between observation of index 0 and the one of index 2.

Euclidean_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:])

 1550000.002117049

We compute the Euclidean distances matrix for the data-set Data_quant.

Euclidean_Dist_Matrix(Data_quant)

array([[       0.        ,   150000.00727904,  1550000.00211705, ...,
         1500000.00009635,  2700000.01899102, 12100000.00553371],
       [  150000.00727904,        0.        ,  1700000.00034565, ...,
         1650000.00026782,  2550000.0146678 , 11950000.00426352],
       [ 1550000.00211705,  1700000.00034565,        0.        , ...,
           50000.040973  ,  4250000.00673279, 13650000.00297389],
       ...,
       [ 1500000.00009635,  1650000.00026782,    50000.040973  , ...,
               0.        ,  4200000.01094663, 13600000.00447653],
       [ 2700000.01899102,  2550000.0146678 ,  4250000.00673279, ...,
         4200000.01094663,        0.        ,  9400000.00011113],
       [12100000.00553371, 11950000.00426352, 13650000.00297389, ...,
        13600000.00447653,  9400000.00011113,        0.        ]])

Now, we are going to repeat the same procedure with other available distances in PyDistances.

Computing Minkowski distance

Minkowski_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:], q=1)

0.0

Minkowski_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], q=1)

 1550081.062526

Minkowski_Dist_Matrix(Data_quant, q=1)

array([[       0.      ,   150046.748877,  1550081.062526, ...,
         1500017.050769,  2700320.266531, 12100365.997115],
       [  150046.748877,        0.      ,  1700034.338187, ...,
         1650029.78435 ,  2550273.554024, 11950319.272776],
       [ 1550081.062526,  1700034.338187,        0.      , ...,
           50064.027555,  4250239.302851, 13650284.955165],
       ...,
       [ 1500017.050769,  1650029.78435 ,    50064.027555, ...,
               0.      ,  4200303.29563 , 13600348.947944],
       [ 2700320.266531,  2550273.554024,  4250239.302851, ...,
         4200303.29563 ,        0.      ,  9400045.764238],
       [12100365.997115, 11950319.272776, 13650284.955165, ...,
        13600348.947944,  9400045.764238,        0.      ]])

Computing Canberra distance

Canberra_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:])

0.0

Canberra_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:])

 0.6913917083019879

Canberra_Dist_Matrix(Data_quant)

array([[0.        , 0.21629237, 0.69139171, ..., 0.463675  , 0.9485963 ,
        1.33838751],
       [0.21629237, 0.        , 0.53043317, ..., 0.52079671, 0.79157752,
        1.19854721],
       [0.69139171, 0.53043317, 0.        , ..., 0.23597883, 1.04765637,
        1.29619958],
       ...,
       [0.463675  , 0.52079671, 0.23597883, ..., 0.        , 1.20126891,
        1.44813664],
       [0.9485963 , 0.79157752, 1.04765637, ..., 1.20126891, 0.        ,
        0.51782969],
       [1.33838751, 1.19854721, 1.29619958, ..., 1.44813664, 0.51782969,
        0.        ]])

Computing Pearson distance

Pearson_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:], variance=Data.var())

0.0

Pearson_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], variance=Data.var())

 1.5393297661160206

Pearson_Dist_Matrix(Data_quant)

array([[0.        , 0.63961801, 1.53932977, ..., 1.03084131, 4.32943281,
        7.47171915],
       [0.63961801, 0.        , 1.20505141, ..., 1.09780711, 3.76643257,
        7.04893716],
       [1.53932977, 1.20505141, 0.        , ..., 0.84617436, 3.79891055,
        7.4670243 ],
       ...,
       [1.03084131, 1.09780711, 0.84617436, ..., 0.        , 4.44143053,
        7.87905955],
       [4.32943281, 3.76643257, 3.79891055, ..., 4.44143053, 0.        ,
        4.57460318],
       [7.47171915, 7.04893716, 7.4670243 , ..., 7.87905955, 4.57460318,
        0.        ]])

Computing Mahalanobis distance

Mahalanobis_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], S_inv=np.linalg.inv( np.cov(Data_quant , rowvar=False) ))

0.0

Mahalanobis_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], S_inv=np.linalg.inv( np.cov(Data_quant , rowvar=False) ))

  2.7671855371187757

Mahalanobis_Dist_Matrix(Data_quant)

array([[0.        , 0.92801614, 2.76718554, ..., 1.52541554, 5.21105193,
        6.45997793],
       [0.92801614, 0.        , 1.96135599, ..., 0.98693199, 4.43479282,
        6.2920865 ],
       [2.76718554, 1.96135599, 0.        , ..., 1.3592188 , 3.4307313 ,
        7.27986558],
       ...,
       [1.52541554, 0.98693199, 1.3592188 , ..., 0.        , 4.41360406,
        7.01503103],
       [5.21105193, 4.43479282, 3.4307313 , ..., 4.41360406, 0.        ,
        7.4691448 ],
       [6.45997793, 6.2920865 , 7.27986558, ..., 7.01503103, 7.4691448 ,
        0.        ]])

Computing Sokal similarity

a,b,c,d,p = a_b_c_d_Matrix(Data_binary)

Sokal_Similarity(i=0, r=2, a=a, d=d, p=p)

1.0

Sokal_Dist(i=0, r=2, a=a, d=d, p=p)

0.0

Sokal_Dist_Matrix(Data_binary)

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.81649658],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.81649658],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.81649658],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.81649658],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.81649658],
       [0.81649658, 0.81649658, 0.81649658, ..., 0.81649658, 0.81649658,
        0.        ]])

Computing Jaccard similarity

Jaccard_Similarity(i=0, r=2, a=a, d=d, p=p)

1.0

Jaccard_Dist(i=0, r=2, a=a, d=d, p=p)

0.0

Jaccard_Dist_Matrix(Data_binary)

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [1., 1., 1., ..., 1., 1., 0.]])

Computing Matching similarity

Matching_Similarity(x_i=Data_multiclass.iloc[0,:], x_r=Data_multiclass.iloc[2,:], Data=Data_multiclass)

0.3333333333333333

Matching_Dist(x_i=Data_multiclass.iloc[0,:], x_r=Data_multiclass.iloc[2,:], Data=Data_multiclass)

   1.1547005383792517

Matching_Dist_Matrix(Data_multiclass)

array([[0.        , 0.81649658, 1.15470054, ..., 0.81649658, 1.15470054,
        1.41421356],
       [0.81649658, 0.        , 1.15470054, ..., 0.        , 1.15470054,
        1.41421356],
       [1.15470054, 1.15470054, 0.        , ..., 1.15470054, 0.81649658,
        1.15470054],
       ...,
       [0.81649658, 0.        , 1.15470054, ..., 0.        , 1.15470054,
        1.41421356],
       [1.15470054, 1.15470054, 0.81649658, ..., 1.15470054, 0.        ,
        1.15470054],
       [1.41421356, 1.41421356, 1.15470054, ..., 1.41421356, 1.15470054,
        0.        ]])

Computing Gower distance

From a theoretical perspective Gower (1971) has been followed.

Gower_Similarity_Matrix(Data, p1=4, p2=3, p3=3)

array([[1.        , 0.85175283, 0.68485131, ..., 0.83008431, 0.62482353,
        0.34709882],
       [0.85175283, 1.        , 0.69489168, ..., 0.94863663, 0.63064768,
        0.35833279],
       [0.68485131, 0.69489168, 1.        , ..., 0.72293677, 0.73120218,
        0.48172501],
       ...,
       [0.83008431, 0.94863663, 0.72293677, ..., 1.        , 0.59776459,
        0.36311382],
       [0.62482353, 0.63064768, 0.73120218, ..., 0.59776459, 1.        ,
        0.55654437],
       [0.34709882, 0.35833279, 0.48172501, ..., 0.36311382, 0.55654437,
        1.        ]])

Gower_Dist_Matrix(Data, p1=4, p2=3, p3=3)

array([[0.        , 0.38502879, 0.56138105, ..., 0.41220831, 0.61251651,
        0.808023  ],
       [0.38502879, 0.        , 0.55236611, ..., 0.22663488, 0.60774363,
        0.80104133],
       [0.56138105, 0.55236611, 0.        , ..., 0.52636796, 0.51845716,
        0.71991318],
       ...,
       [0.41220831, 0.22663488, 0.52636796, ..., 0.        , 0.63422032,
        0.79805149],
       [0.61251651, 0.60774363, 0.51845716, ..., 0.63422032, 0.        ,
        0.66592464],
       [0.808023  , 0.80104133, 0.71991318, ..., 0.79805149, 0.66592464,
        0.        ]])

Computing Robust Mahalanobis distance

From a theoretical perspective Gnanadesikan (1997) and Delvin et al. (1975) have been followed.

Robust_Mahalanobis_Dist(x_i=Data_quant.iloc[0,:], x_r=Data_quant.iloc[2,:], Data=Data_quant, Method='MAD', epsilon=0.05, n_iters=20)

 2.1448247626892223

Robust_Mahalanobis_Dist(x_i=Data_quant.iloc[0,:], x_r=Data_quant.iloc[2,:], Data=Data_quant, Method='trimmed', alpha=0.1, epsilon=0.05, n_iters=20)

 2.7434709885399884

Robust_Mahalanobis_Dist(x_i=Data_quant.iloc[0,:], x_r=Data_quant.iloc[2,:], Data=Data_quant, Method='winsorized', alpha=0.1, epsilon=0.05, n_iters=20)

 2.8446274140577943

Robust_Mahalanobis_Dist_Matrix(Data=Data_quant, Method='trimmed', alpha=0.1, epsilon=0.05, n_iters=20)

array([[ 0.        ,  0.89250845,  2.74347099, ...,  1.48503889,
         5.95276234,  8.49453068],
       [ 0.89250845,  0.        ,  1.99959936, ...,  0.96839524,
         5.33355737,  8.32070442],
       [ 2.74347099,  1.99959936,  0.        , ...,  1.36336733,
         4.12306341,  9.38094479],
       ...,
       [ 1.48503889,  0.96839524,  1.36336733, ...,  0.        ,
         5.1322854 ,  9.00337923],
       [ 5.95276234,  5.33355737,  4.12306341, ...,  5.1322854 ,
         0.        , 11.06785954],
       [ 8.49453068,  8.32070442,  9.38094479, ...,  9.00337923,
        11.06785954,  0.        ]])

Computing Generalized Gower distance and Releted Metric Scaling

To end this tutorial we are going to compute both the Gower distance matrix and the Related Metric Scaling matrix for the mixed data-set Data. And we are going to do that considering all the possible combinations of the quantitative, binary and multiclass distances. Then, we will save all the resulting matrix in a Python dictionary.

From a theoretical perspective we have followed Cuadras and Fortiana (1998), Albarrán et al. (2015) and Grané et al. (2021).

D_GG_list_maha_robust = []
D_RelMS_list_maha_robust = []
D_GG_list_not_maha_robust = []
D_RelMS_list_not_maha_robust = []

d1_list = ['Euclidean', 'Minkowski', 'Canberra', 'Pearson', 'Mahalanobis']
d2_list = ['Sokal', 'Jaccard']
d3_list = ['Matching']

for d in itertools.product(d1_list, d2_list, d3_list) :
    Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], q=1)
    D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)
    D_GG_list_not_maha_robust.append(D)

for d in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD']) :
   Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], epsilon=0.05, Method=d[3], alpha=0.1)
   D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)
   D_GG_list_maha_robust.append(D)

for d in itertools.product(d1_list, d2_list, d3_list) :
   Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], q=1)
   D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True, tol=0.009, d=2)
   D_RelMS_list_not_maha_robust.append(D)

for d in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD']) :
   Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], epsilon=0.05, Method=d[3], alpha=0.1)
   D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True, tol=0.009, d=2)
   D_RelMS_list_maha_robust.append(D)

D_GG_list = D_GG_list_not_maha_robust + D_GG_list_maha_robust
D_RelMS_list = D_RelMS_list_not_maha_robust + D_RelMS_list_maha_robust

search_space = [x  for x in D_GG_list] + [x  for x in D_RelMS_list]
distance_names = ['GG_'+x[0]+'_'+x[1]+'_'+x[2]  for x in itertools.product(d1_list, d2_list, d3_list)] + ['GG_'+x[0]+'_'+x[1]+'_'+x[2]+'_'+x[3] for x in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD'])] + ['RelMS_'+x[0]+'_'+x[1]+'_'+x[2] for x in itertools.product(d1_list, d2_list, d3_list)] + ['RelMS_'+x[0]+'_'+x[1]+'_'+x[2]+'_'+x[3] for x in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD'])]
dic_distance_matrix = dict(zip(distance_names, search_space))

dic_distance_matrix

{'GG_Euclidean_Sokal_Matching': array([[0.        , 1.01161446, 1.60800698, ..., 1.23798333, 1.92432848,
         6.35838514],
        [1.01161446, 0.        , 1.64229596, ..., 0.7889253 , 1.87696727,
         6.29319748],
        [1.60800698, 1.64229596, 0.        , ..., 1.42723912, 2.26882579,
         6.96673669],
        ...,
        [1.23798333, 0.7889253 , 1.42723912, ..., 0.        , 2.4635748 ,
         7.01727531],
        [1.92432848, 1.87696727, 2.26882579, ..., 2.4635748 , 0.        ,
         5.11270638],
        [6.35838514, 6.29319748, 6.96673669, ..., 7.01727531, 5.11270638,
         0.        ]]),
 'GG_Euclidean_Jaccard_Matching': array([[0.        , 1.01161446, 1.60800698, ..., 1.23798333, 1.92432848,
         6.21923207],
        [1.01161446, 0.        , 1.64229596, ..., 0.7889253 , 1.87696727,
         6.15257024],
        [1.60800698, 1.64229596, 0.        , ..., 1.42723912, 2.26882579,
         6.83997121],
        ...,
        [1.23798333, 0.7889253 , 1.42723912, ..., 0.        , 2.4635748 ,
         6.89143953],
        [1.92432848, 1.87696727, 2.26882579, ..., 2.4635748 , 0.        ,
         4.93857798],
        [6.21923207, 6.15257024, 6.83997121, ..., 6.89143953, 4.93857798,
         0.        ]]),
 'GG_Minkowski_Sokal_Matching': array([[0.        , 1.01161589, 1.60801451, ..., 1.23797549, 1.92440501,
         6.35838512],
        [1.01161589, 0.        , 1.64229192, ..., 0.78891568, 1.87702827,
         6.29317915],
        [1.60801451, 1.64229192, 0.        , ..., 1.42723962, 2.2688732 ,
         6.96667937],
        ...,
        [1.23797549, 0.78891568, 1.42723962, ..., 0.        , 2.46364348,
         7.01724763],
        [1.92440501, 1.87702827, 2.2688732 , ..., 2.46364348, 0.        ,
         5.11260609],
        [6.35838512, 6.29317915, 6.96667937, ..., 7.01724763, 5.11260609,
         0.        ]]),
 'GG_Minkowski_Jaccard_Matching': array([[0.        , 1.01161589, 1.60801451, ..., 1.23797549, 1.92440501,
         6.21923205],
        [1.01161589, 0.        , 1.64229192, ..., 0.78891568, 1.87702827,
         6.15255149],
        [1.60801451, 1.64229192, 0.        , ..., 1.42723962, 2.2688732 ,
         6.83991282],
        ...,
        [1.23797549, 0.78891568, 1.42723962, ..., 0.        , 2.46364348,
         6.89141134],
        [1.92440501, 1.87702827, 2.2688732 , ..., 2.46364348, 0.        ,
         4.93847416],
        [6.21923205, 6.15255149, 6.83991282, ..., 6.89141134, 4.93847416,
         0.        ]]),
 'GG_Canberra_Sokal_Matching': array([[0.        , 1.1089173 , 2.04873576, ..., 1.41070641, 2.47064802,
         3.88007815],
        [1.1089173 , 0.        , 1.81887649, ..., 1.10728448, 2.20656591,
         3.66760203],
        [2.04873576, 1.81887649, 0.        , ..., 1.51266848, 2.44536222,
         3.67890583],
        ...,
        [1.41070641, 1.10728448, 1.51266848, ..., 0.        , 2.92569072,
         4.05431191],
        [2.47064802, 2.20656591, 2.44536222, ..., 2.92569072, 0.        ,
         2.67423498],
        [3.88007815, 3.66760203, 3.67890583, ..., 4.05431191, 2.67423498,
         0.        ]]),
 'GG_Canberra_Jaccard_Matching': array([[0.        , 1.1089173 , 2.04873576, ..., 1.41070641, 2.47064802,
         3.64757349],
        [1.1089173 , 0.        , 1.81887649, ..., 1.10728448, 2.20656591,
         3.42068569],
        [2.04873576, 1.81887649, 0.        , ..., 1.51266848, 2.44536222,
         3.43280265],
        ...,
        [1.41070641, 1.10728448, 1.51266848, ..., 0.        , 2.92569072,
         3.83239234],
        [2.47064802, 2.20656591, 2.44536222, ..., 2.92569072, 0.        ,
         2.32407372],
        [3.64757349, 3.42068569, 3.43280265, ..., 3.83239234, 2.32407372,
         0.        ]]),
 'GG_Pearson_Sokal_Matching': array([[0.        , 1.0588577 , 1.62258227, ..., 1.13386485, 2.59878376,
         4.5833716 ],
        [1.0588577 , 0.        , 1.54980561, ..., 0.55073019, 2.36782324,
         4.41160916],
        [1.62258227, 1.54980561, 0.        , ..., 1.48883715, 2.15643298,
         4.46893998],
        ...,
        [1.13386485, 0.55073019, 1.48883715, ..., 0.        , 2.64592015,
         4.75194328],
        [2.59878376, 2.36782324, 2.15643298, ..., 2.64592015, 0.        ,
         3.34753806],
        [4.5833716 , 4.41160916, 4.46893998, ..., 4.75194328, 3.34753806,
         0.        ]]),
 'GG_Pearson_Jaccard_Matching': array([[0.        , 1.0588577 , 1.62258227, ..., 1.13386485, 2.59878376,
         4.38828909],
        [1.0588577 , 0.        , 1.54980561, ..., 0.55073019, 2.36782324,
         4.20857237],
        [1.62258227, 1.54980561, 0.        , ..., 1.48883715, 2.15643298,
         4.26863098],
        ...,
        [1.13386485, 0.55073019, 1.48883715, ..., 0.        , 2.64592015,
         4.56407174],
        [2.59878376, 2.36782324, 2.15643298, ..., 2.64592015, 0.        ,
         3.07502796],
        [4.38828909, 4.20857237, 4.26863098, ..., 4.56407174, 3.07502796,
         0.        ]]),
 'GG_Mahalanobis_Sokal_Matching': array([[0.        , 1.11128701, 1.9908619 , ..., 1.26642065, 2.97833241,
         4.17851469],
        [1.11128701, 0.        , 1.73337267, ..., 0.49510815, 2.64311668,
         4.11353573],
        [1.9908619 , 1.73337267, 0.        , ..., 1.5815777 , 1.99507289,
         4.39053781],
        ...,
        [1.26642065, 0.49510815, 1.5815777 , ..., 0.        , 2.63417571,
         4.3979867 ],
        [2.97833241, 2.64311668, 1.99507289, ..., 2.63417571, 0.        ,
         4.4698317 ],
        [4.17851469, 4.11353573, 4.39053781, ..., 4.3979867 , 4.4698317 ,
         0.        ]]),
 'GG_Mahalanobis_Jaccard_Matching': array([[0.        , 1.11128701, 1.9908619 , ..., 1.26642065, 2.97833241,
         3.96355535],
        [1.11128701, 0.        , 1.73337267, ..., 0.49510815, 2.64311668,
         3.89499193],
        [1.9908619 , 1.73337267, 0.        , ..., 1.5815777 , 1.99507289,
         4.18647921],
        ...,
        [1.26642065, 0.49510815, 1.5815777 , ..., 0.        , 2.63417571,
         4.19429052],
        [2.97833241, 2.64311668, 1.99507289, ..., 2.63417571, 0.        ,
         4.26956454],
        [3.96355535, 3.89499193, 4.18647921, ..., 4.19429052, 4.26956454,
         0.        ]]),
 'GG_Robust_Mahalanobis_Sokal_Matching_trimmed': array([[0.        , 1.0738818 , 1.81990287, ..., 1.17982158, 2.83584093,
         4.38026385],
        [1.0738818 , 0.        , 1.64744788, ..., 0.39866732, 2.61869851,
         4.3233478 ],
        [1.81990287, 1.64744788, 0.        , ..., 1.53344794, 1.97466567,
         4.56660697],
        ...,
        [1.17982158, 0.39866732, 1.53344794, ..., 0.        , 2.54962302,
         4.5492545 ],
        [2.83584093, 2.61869851, 1.97466567, ..., 2.54962302, 0.        ,
         5.16721825],
        [4.38026385, 4.3233478 , 4.56660697, ..., 4.5492545 , 5.16721825,
         0.        ]]),
 'GG_Robust_Mahalanobis_Sokal_Matching_winsorized': array([[0.        , 1.10035027, 1.96521318, ..., 1.24876507, 3.02193061,
         4.2158267 ],
        [1.10035027, 0.        , 1.72244788, ..., 0.45786845, 2.71169847,
         4.170886  ],
        [1.96521318, 1.72244788, 0.        , ..., 1.57396145, 2.01907767,
         4.45138733],
        ...,
        [1.24876507, 0.45786845, 1.57396145, ..., 0.        , 2.6589383 ,
         4.42575055],
        [3.02193061, 2.71169847, 2.01907767, ..., 2.6589383 , 0.        ,
         4.74960743],
        [4.2158267 , 4.170886  , 4.45138733, ..., 4.42575055, 4.74960743,
         0.        ]]),
 'GG_Robust_Mahalanobis_Sokal_Matching_MAD': array([[0.        , 1.09006233, 1.80375514, ..., 1.18201607, 2.67497233,
         4.55678538],
        [1.09006233, 0.        , 1.62058379, ..., 0.44488228, 2.40606721,
         4.40232615],
        [1.80375514, 1.62058379, 0.        , ..., 1.53278692, 1.93813141,
         4.46679441],
        ...,
        [1.18201607, 0.44488228, 1.53278692, ..., 0.        , 2.48916367,
         4.64371521],
        [2.67497233, 2.40606721, 1.93813141, ..., 2.48916367, 0.        ,
         4.16671594],
        [4.55678538, 4.40232615, 4.46679441, ..., 4.64371521, 4.16671594,
         0.        ]]),
 'GG_Robust_Mahalanobis_Jaccard_Matching_trimmed': array([[0.        , 1.0738818 , 1.81990287, ..., 1.17982158, 2.83584093,
         4.17570322],
        [1.0738818 , 0.        , 1.64744788, ..., 0.39866732, 2.61869851,
         4.11595944],
        [1.81990287, 1.64744788, 0.        , ..., 1.53344794, 1.97466567,
         4.37077626],
        ...,
        [1.17982158, 0.39866732, 1.53344794, ..., 0.        , 2.54962302,
         4.35264315],
        [2.83584093, 2.61869851, 1.97466567, ..., 2.54962302, 0.        ,
         4.99499053],
        [4.17570322, 4.11595944, 4.37077626, ..., 4.35264315, 4.99499053,
         0.        ]]),
 'GG_Robust_Mahalanobis_Jaccard_Matching_winsorized': array([[0.        , 1.10035027, 1.96521318, ..., 1.24876507, 3.02193061,
         4.00287155],
        [1.10035027, 0.        , 1.72244788, ..., 0.45786845, 2.71169847,
         3.95551209],
        [1.96521318, 1.72244788, 0.        , ..., 1.57396145, 2.01907767,
         4.25025118],
        ...,
        [1.24876507, 0.45786845, 1.57396145, ..., 0.        , 2.6589383 ,
         4.22339365],
        [3.02193061, 2.71169847, 2.01907767, ..., 2.6589383 , 0.        ,
         4.5616397 ],
        [4.00287155, 3.95551209, 4.25025118, ..., 4.22339365, 4.5616397 ,
         0.        ]]),
 'GG_Robust_Mahalanobis_Jaccard_Matching_MAD': array([[0.        , 1.09006233, 1.80375514, ..., 1.18201607, 2.67497233,
         4.36051361],
        [1.09006233, 0.        , 1.62058379, ..., 0.44488228, 2.40606721,
         4.19884049],
        [1.80375514, 1.62058379, 0.        , ..., 1.53278692, 1.93813141,
         4.26638468],
        ...,
        [1.18201607, 0.44488228, 1.53278692, ..., 0.        , 2.48916367,
         4.45127812],
        [2.67497233, 2.40606721, 1.93813141, ..., 2.48916367, 0.        ,
         3.95111474],
        [4.36051361, 4.19884049, 4.26638468, ..., 4.45127812, 3.95111474,
         0.        ]]),
 'RelMS_Euclidean_Sokal_Matching': array([[0.        , 1.01092438, 1.68587263, ..., 1.2435966 , 1.75479379,
         5.76354972],
        [1.01092436, 0.        , 1.72123768, ..., 0.78892531, 1.71977376,
         5.69924943],
        [1.68587264, 1.7212377 , 0.        , ..., 1.42997022, 2.20660915,
         6.5504967 ],
        ...,
        [1.24359658, 0.78892532, 1.42997021, ..., 0.        , 2.26671431,
         6.42377887],
        [1.7547938 , 1.71977375, 2.20660914, ..., 2.26671431, 0.        ,
         4.781135  ],
        [5.76354972, 5.69924943, 6.55049671, ..., 6.42377887, 4.78113499,
         0.        ]]),
 'RelMS_Euclidean_Jaccard_Matching': array([[0.        , 1.01092435, 1.68587263, ..., 1.24359659, 1.75479381,
         5.73873464],
        [1.01092437, 0.        , 1.72123769, ..., 0.78892532, 1.71977378,
         5.67208311],
        [1.68587264, 1.72123769, 0.        , ..., 1.42997021, 2.20660914,
         6.53309456],
        ...,
        [1.24359658, 0.78892529, 1.42997021, ..., 0.        , 2.26671431,
         6.41402297],
        [1.7547938 , 1.71977375, 2.20660914, ..., 2.2667143 , 0.        ,
         4.6957284 ],
        [5.73873463, 5.67208312, 6.53309457, ..., 6.41402297, 4.69572838,
         0.        ]]),
 'RelMS_Minkowski_Sokal_Matching': array([[0.        , 1.0104344 , 1.68473307, ..., 1.24302039, 1.75451827,
         5.7636572 ],
        [1.01043437, 0.        , 1.72039524, ..., 0.78891568, 1.71978231,
         5.69946617],
        [1.68473308, 1.72039525, 0.        , ..., 1.42922921, 2.20651554,
         6.55109162],
        ...,
        [1.24302037, 0.7889157 , 1.4292292 , ..., 0.        , 2.2667207 ,
         6.42402052],
        [1.75451827, 1.71978229, 2.20651553, ..., 2.2667207 , 0.        ,
         4.78235997],
        [5.7636572 , 5.69946616, 6.55109161, ..., 6.42402052, 4.78235997,
         0.        ]]),
 'RelMS_Minkowski_Jaccard_Matching': array([[0.        , 1.01043437, 1.68473307, ..., 1.24302038, 1.75451828,
         5.73875343],
        [1.01043439, 0.        , 1.72039525, ..., 0.78891569, 1.71978232,
         5.67221733],
        [1.68473307, 1.72039524, 0.        , ..., 1.4292292 , 2.20651553,
         6.5336026 ],
        ...,
        [1.24302038, 0.78891568, 1.4292292 , ..., 0.        , 2.2667207 ,
         6.41417732],
        [1.75451828, 1.7197823 , 2.20651553, ..., 2.2667207 , 0.        ,
         4.6969009 ],
        [5.73875342, 5.67221732, 6.5336026 , ..., 6.41417732, 4.6969009 ,
         0.        ]]),
 'RelMS_Canberra_Sokal_Matching': array([[0.        , 3.29475825, 3.63767326, ..., 3.42002989, 3.78234978,
         4.28387746],
        [3.29475817, 0.        , 3.54627477, ..., 3.36365755, 3.64707779,
         4.11290306],
        [3.63767327, 3.5462748 , 0.        , ..., 3.36371231, 3.88636668,
         4.26421609],
        ...,
        [3.42002989, 3.36365756, 3.36371231, ..., 0.        , 4.08835735,
         4.43146723],
        [3.78234979, 3.64707779, 3.88636667, ..., 4.08835736, 0.        ,
         3.55682862],
        [4.28387745, 4.11290305, 4.26421607, ..., 4.43146723, 3.55682862,
         0.        ]]),
 'RelMS_Canberra_Jaccard_Matching': array([[0.        , 3.29475816, 3.63767325, ..., 3.42002988, 3.7823498 ,
         4.18398249],
        [3.29475818, 0.        , 3.54627479, ..., 3.36365756, 3.64707782,
         4.00084943],
        [3.63767326, 3.54627478, 0.        , ..., 3.36371229, 3.88636666,
         4.15092751],
        ...,
        [3.42002988, 3.36365755, 3.36371228, ..., 0.        , 4.08835736,
         4.3378168 ],
        [3.78234979, 3.64707778, 3.88636666, ..., 4.08835735, 0.        ,
         3.36218137],
        [4.18398248, 4.00084941, 4.15092752, ..., 4.3378168 , 3.36218137,
         0.        ]]),
 'RelMS_Pearson_Sokal_Matching': array([[0.        , 1.04250916, 1.57029271, ..., 1.11835441, 2.35030151,
         3.99961285],
        [1.04250913, 0.        , 1.55642417, ..., 0.55073019, 2.17276224,
         3.83629275],
        [1.5702927 , 1.55642418, 0.        , ..., 1.44481248, 2.11094744,
         4.05200057],
        ...,
        [1.11835439, 0.55073021, 1.44481248, ..., 0.        , 2.43447697,
         4.16544183],
        [2.35030151, 2.17276223, 2.11094745, ..., 2.43447697, 0.        ,
         3.00502738],
        [3.99961283, 3.83629274, 4.05200056, ..., 4.16544183, 3.00502738,
         0.        ]]),
 'RelMS_Pearson_Jaccard_Matching': array([[0.        , 1.04250913, 1.57029271, ..., 1.11835441, 2.35030152,
         3.89789603],
        [1.04250915, 0.        , 1.55642418, ..., 0.55073023, 2.17276226,
         3.72479069],
        [1.5702927 , 1.55642415, 0.        , ..., 1.44481247, 2.11094744,
         3.94329467],
        ...,
        [1.11835439, 0.55073016, 1.44481248, ..., 0.        , 2.43447698,
         4.07654071],
        [2.35030152, 2.17276223, 2.11094745, ..., 2.43447697, 0.        ,
         2.77842982],
        [3.89789601, 3.72479067, 3.94329467, ..., 4.0765407 , 2.77842982,
         0.        ]]),
 'RelMS_Mahalanobis_Sokal_Matching': array([[0.        , 1.0872495 , 1.91566724, ..., 1.23718333, 2.78694322,
         3.59368169],
        [1.08724948, 0.        , 1.72190382, ..., 0.49510814, 2.51013925,
         3.52430362],
        [1.91566725, 1.72190383, 0.        , ..., 1.53860587, 1.97114821,
         3.91897956],
        ...,
        [1.23718333, 0.49510818, 1.53860586, ..., 0.        , 2.47401146,
         3.7944967 ],
        [2.78694323, 2.51013924, 1.97114821, ..., 2.47401146, 0.        ,
         4.10401609],
        [3.59368167, 3.52430361, 3.91897955, ..., 3.7944967 , 4.10401609,
         0.        ]]),
 'RelMS_Mahalanobis_Jaccard_Matching': array([[0.        , 1.08724947, 1.91566724, ..., 1.23718333, 2.78694323,
         3.46907215],
        [1.0872495 , 0.        , 1.72190383, ..., 0.49510817, 2.51013926,
         3.39550188],
        [1.91566724, 1.72190381, 0.        , ..., 1.53860586, 1.97114821,
         3.80535063],
        ...,
        [1.23718333, 0.49510812, 1.53860586, ..., 0.        , 2.47401147,
         3.68911387],
        [2.78694323, 2.51013924, 1.97114821, ..., 2.47401147, 0.        ,
         3.96214705],
        [3.46907213, 3.39550187, 3.80535063, ..., 3.68911387, 3.96214705,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Sokal_Matching_trimmed': array([[0.        , 1.05396495, 1.74951184, ..., 1.15390312, 2.67058462,
         3.82780883],
        [1.05396493, 0.        , 1.63479812, ..., 0.39866731, 2.51224528,
         3.76362714],
        [1.74951185, 1.63479814, 0.        , ..., 1.49657109, 1.961588  ,
         4.09825745],
        ...,
        [1.15390311, 0.39866735, 1.49657109, ..., 0.        , 2.41854434,
         3.97375586],
        [2.67058463, 2.51224527, 1.961588  , ..., 2.41854434, 0.        ,
         4.81269468],
        [3.82780882, 3.76362713, 4.09825744, ..., 3.97375586, 4.81269468,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Sokal_Matching_winsorized': array([[0.        , 1.07688717, 1.88851059, ..., 1.21940102, 2.83800382,
         3.64003684],
        [1.07688713, 0.        , 1.70819251, ..., 0.45786842, 2.58662722,
         3.59029333],
        [1.8885106 , 1.70819253, 0.        , ..., 1.53220354, 1.99808026,
         3.97860895],
        ...,
        [1.21940101, 0.45786849, 1.53220353, ..., 0.        , 2.50787408,
         3.829693  ],
        [2.83800382, 2.58662721, 1.99808026, ..., 2.50787408, 0.        ,
         4.38739858],
        [3.64003683, 3.59029333, 3.97860894, ..., 3.829693  , 4.38739858,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Sokal_Matching_MAD': array([[0.        , 1.06915308, 1.73228661, ..., 1.15789936, 2.45834684,
         3.97049139],
        [1.06915305, 0.        , 1.61195487, ..., 0.44488227, 2.24973009,
         3.81621214],
        [1.73228661, 1.61195488, 0.        , ..., 1.4894837 , 1.90536576,
         4.00431571],
        ...,
        [1.15789934, 0.44488231, 1.4894837 , ..., 0.        , 2.30824179,
         4.04102682],
        [2.45834685, 2.24973009, 1.90536577, ..., 2.30824178, 0.        ,
         3.79967402],
        [3.97049139, 3.81621213, 4.0043157 , ..., 4.04102682, 3.79967402,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Jaccard_Matching_trimmed': array([[0.        , 1.05396492, 1.74951184, ..., 1.15390312, 2.67058463,
         3.7103996 ],
        [1.05396495, 0.        , 1.63479813, ..., 0.39866734, 2.51224529,
         3.64245313],
        [1.74951185, 1.63479812, 0.        , ..., 1.49657109, 1.961588  ,
         3.98729219],
        ...,
        [1.15390311, 0.39866728, 1.49657109, ..., 0.        , 2.41854435,
         3.87035377],
        [2.67058464, 2.51224527, 1.961588  , ..., 2.41854434, 0.        ,
         4.69932707],
        [3.71039959, 3.64245311, 3.9872922 , ..., 3.87035377, 4.69932707,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Jaccard_Matching_winsorized': array([[0.        , 1.07688714, 1.88851059, ..., 1.21940102, 2.83800383,
         3.51619033],
        [1.07688715, 0.        , 1.70819252, ..., 0.45786846, 2.58662723,
         3.46347473],
        [1.88851059, 1.70819251, 0.        , ..., 1.53220354, 1.99808026,
         3.86606614],
        ...,
        [1.219401  , 0.45786843, 1.53220353, ..., 0.        , 2.50787409,
         3.72394257],
        [2.83800382, 2.58662721, 1.99808026, ..., 2.50787408, 0.        ,
         4.25828147],
        [3.51619032, 3.46347472, 3.86606614, ..., 3.72394256, 4.25828147,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Jaccard_Matching_MAD': array([[0.        , 1.06915304, 1.73228661, ..., 1.15789935, 2.45834686,
         3.86694579],
        [1.06915307, 0.        , 1.61195488, ..., 0.4448823 , 2.24973011,
         3.7045599 ],
        [1.7322866 , 1.61195486, 0.        , ..., 1.48948369, 1.90536575,
         3.89571711],
        ...,
        [1.15789934, 0.44488225, 1.48948369, ..., 0.        , 2.30824179,
         3.9478467 ],
        [2.45834686, 2.24973009, 1.90536576, ..., 2.30824179, 0.        ,
         3.64285626],
        [3.86694578, 3.70455988, 3.8957171 , ..., 3.9478467 , 3.64285626,
         0.        ]])}

Computational Cost Testing

In this case, we are going to use the entire House_Price.csv dataset, which has 1905 rows, to perform a computational cost test (in terms of time) of the new distance metrics included in PyDistances.

Data = pd.read_csv('House_Price.csv')
Data = Data.loc[:, ['latitude', 'longitude', 'price', 'size_in_m_2', 'balcony_recode', 'private_garden_recode', 'private_gym_recode', 'quality_recode', 'no_of_bathrooms', 'no_of_bedrooms']]

Data.shape

(1905, 10)

Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='trimmed', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)

# Time: 1.11 minutes.

Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='winsorized', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)

# Time: 1.15 minutes.

Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='MAD', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)

# Time: 1.12 minutes.

Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='trimmed', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True)

# Time: 1.58 minutes.

Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='winsorized', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True)

# Time: 1.53 minutes.

Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='MAD', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True)

# Time: 1.55 minutes.

We can compare these times with the one obtained by (simple) Gower distance.

Gower_Dist_Matrix(Data, p1=4, p2=3, p3=3)

# Time: 38 seconds.

Bibliography

Albarrán, I., P. Alonso, and A. Grané “Profile Identification via Weighted Related Metric Scaling: An Application to Dependent Spanish Children.” Journal of the Royal Statistical Society. Series A, Statistics in Society 178, no. 3 (2015): 593–618. https://doi.org/10.1111/rssa.12084stex:B88856BB540BB0134A72028E02D7B00CBED08217.

Cuadras, C. M., and J. Fortiana. “Chapter 25 - Visualizing Categorical Data with Related Metric Scaling.” In Visualization of Categorical Data, 365–76. Academic Press, 1998. https://doi.org/10.1016/B978-012299045-8/50028-0.

Devlin, S. J., R. Gnanadesikan, and J. R. Kettenring. “Robust Estimation and Outlier Detection with Correlation Coefficients.” Biometrika 62, no. 3 (1975): 531–45. https://doi.org/10.1093/biomet/62.3.531.

Grané, A., Manzi G. and S. Salini. "Smart Visualization of Mixed Data". Stats n.º 4 (2021): 472–485. https://doi.org/10.3390/stats4020029

Gower, J. C. “A General Coefficient of Similarity and Some of Its Properties.” Biometrics 27, no. 4 (1971): 857–71. https://doi.org/10.2307/2528823.

Gnanadesikan, R. Methods for Statistical Data Analysis of Multivariate Observations. 2nd ed. New York etc.: : John Wiley and Sons, 1997.

PyDistances
Release 0.0.16

Release 0.0.16

0.0.9

0.0.10

0.0.11

0.0.12

0.0.13

0.0.14

0.0.15

0.0.16

0.0.17

0.0.18

Documentation

PyDistances: A Statistical Distances Python Package

Installation

Example of use

Getting data

Computing Euclidean distance

Computing Minkowski distance

Computing Canberra distance

Computing Pearson distance

Computing Mahalanobis distance

Computing Sokal similarity

Computing Jaccard similarity

Computing Matching similarity

Computing Gower distance

Computing Robust Mahalanobis distance

Computing Generalized Gower distance and Releted Metric Scaling

Computational Cost Testing

Bibliography

Stats

Development practices

Releases

Contributors

PyDistances Release 0.0.16

Release 0.0.16 Toggle Dropdown 0.0.9 0.0.10 0.0.11 0.0.12 0.0.13 0.0.14 0.0.15 0.0.16 0.0.17 0.0.18

Documentation

PyDistances: A Statistical Distances Python Package

Installation

Example of use

Getting data

Computing Euclidean distance

Computing Minkowski distance

Computing Canberra distance

Computing Pearson distance

Computing Mahalanobis distance

Computing Sokal similarity

Computing Jaccard similarity

Computing Matching similarity

Computing Gower distance

Computing Robust Mahalanobis distance

Computing Generalized Gower distance and Releted Metric Scaling

Computational Cost Testing

Bibliography

Stats

Development practices

Releases

Contributors

PyDistances
Release 0.0.16

Release 0.0.16

0.0.9

0.0.10

0.0.11

0.0.12

0.0.13

0.0.14

0.0.15

0.0.16

0.0.17

0.0.18