dredge

User-friendly thresholded subspace-constrained mean shift for geospatial data


Keywords
density, ridges, spatial, analysis, principal, curves, route, optimization, hot, spot
License
MIT
Install
pip install dredge==1.0.0

Documentation

DREDGE

User-friendly thresholded subspace-constrained mean shift for geospatial data

logo

DREDGE, short for Density Ridge Estimation Describing Geospatial Evidence, arguably an unnecessarily forced acronym, offers a new tool to find density ridges in latitude-longitude coordinates based on the subspace-constrained mean shift (SCMS) algorithm introduced by Ozertem and Erdogmus (2011). The tool approximates principal curves for a given set of coordinates, featuring various improvements over the initial algorithm and alterations to facilitate the application to geospatial data: Thresholding, as described in cosmological research by Chen et al. (2015) and Chen et al. (2015), avoids dominant density ridges in sparsely populated areas of the dataset. In addition, the haversine formula is used as a distance metric to calculate the great circle distance, which makes the tool applicable not only to city-scale data, but also to datasets spanning multiple countries by taking the Earth's curvature into consideration.

In essence, DREDGE provides density-based line points which optimize the distance to a dataset of coordinates along those lines, with larger bandwidths leading to a decrease in summed line length and an increase in the average distance to the nearest line. Since DREDGE was initially developed to be applied to crime incident data, the default bandwidth calculation follows a best-practice approach that is well-accepted within quantitative criminology, using the mean distance to a given number of nearest neighbors (Williamson et al., 1999). Since practitioners in that area of study are often interested in the highest-density regions of a dataset, the tool also features the possibility to specify a top-percentage level for a kernel density estimate that the ridge points should fall within.

Installation

DREDGE can be installed via PyPI, with a single command in the terminal:

pip install dredge

Alternatively, the file dredge.py can be downloaded from the folder dredge in this repository and used locally by placing the file into the working directory for a given project. An installation via the terminal is, however, highly recommended, as the installation process will check for the package requirements and automatically update or install any missing dependencies, thus sparing the user the effort of troubleshooting and installing them themselves.

Quickstart guide

DREDGE only requires a two-column NumPy array as its primary input (coordinates), with one data point per row, and latitude and longitude values in the columns. Four additional optional parameters can, however, be set: The number of nearest neighbors (neighbors) used to automatically calculate an optimal bandwidth can be manually changed, the bandwidth (bandwidth) itself can be forced to a certain value, and the threshold used to check for convergence between iterations can be set (threshold). The fourth parameter (percentage) unlocks an additional functionality of DREDGE, as the interest of practitioners is often constrained to high-density areas. For a user-provided percentage value p, the kernel density estimation in the tool's inner workings is used to only retain ridge points above the (100 - p)th percentile of the provided dataset's density landscape. This allows, for example, route matching to be focused on these areas.



Variables Explanations Default
coordinates The spatial data as latitude-longitude coordinates
neighbors (optional) The number of nearest neighbors to get a bandwidth 10
bandwidth (optional) The bandwidth used for kernel density estimates None
convergence (optional) The threshold used for inter-iteration convergence 0.01
percentage (optional) The aimed-for percentage of highest-density ridges None



After the installation via PyPI, or using the dredge.py file locally, the usage looks like this:

from dredge import filaments

filaments(coordinates = your_coordinates,
                        percentage = 5) 

As an example, for UCR-defined Part I crimes from the Chicago Data Portal from 2018, the above call to the filaments function results in the ridges shown in red in the left-hand figure below, with 5,000 sampled crime instances over the given time interval depicted in cyan. Additionally, setting the input parameter percentage to a value of 5 to only retain values in regions above the 95th percentile of a kernel density estimate over the coordinates results in the right-hand figure.

logo