Tile generation for big data


Keywords
gehlenborglab, hi-c, higlass
License
MIT
Install
pip install clodius==0.20.1

Documentation

Documentation

Clodius is a tool for breaking up large data sets into smaller tiles that can subsequently be displayed using an appropriate viewer.

You can provide either a JSON or a tsv (tab separatred value) file as input. If a tsv file is provided, the column names must be passed in as an argument using the --column-names (-c) argument. Also required is a paramter which specifies the columns that will contain the position of each element in the data set. The input data can be n-dimensional. Given the assumed goal of visualization, we expect most use cases to be limited to 1 or 2 dimensional data sets.

Input format

The input can be either a JSON (a list of objects) or tsv (tab separated value) files. TSV values need to be passed in with column headers (using the --column-names or -c options). Eventually you should be able to pass in a filename which contains the column names in its contents.

Output format

Files

Clodius can output a directory tree filled with tile files. The root of the directory is specified using the (--output-dir or -o parameter). This contains the tile_info.json file which contains all of the metadata about the tiling:

{
  "min_importance": 1.0, 
  "min_pos": [
    1.0, 
    1.0
  ], 
  "max_value": 1.0, 
  "min_value": 1.0, 
  "max_zoom": 1, 
  "max_width": 3.0, 
  "max_importance": 1.0, 
  "max_pos": [
    4.0, 
    4.0
  ]
}

Tiles are identified using an ID in the format z.x.y (e.g. 13.165.1765). The first number (z) corresponds to the zoom level (0 or 1), the second (x to the tile number along the first dimension (0, 1, or 2), and the third (y) to the tile number along the third dimension (0, 1, or 2). If the input data was 1D, then the id would have one less number.

Tile boundaries

Tile boundaries are half-open. This an be illustrated using the following code block:

    import json
    import clodius.fpark as cfp
    import clodius.tiles as cti

    sc = cfp.FakeSparkContext
    entries = [{'x1': 1, 'value': 5}, {'x1': 10, 'value': 6}]
    entries = cfp.FakeSparkContext.parallelize(entries)

    tileset = cti.make_tiles_by_importance(sc, entries, ['x1'], max_zoom=0, 
                                           importance_field='value', adapt_zoom=False)

    tiles = tileset['tiles'].collect()
    print json.dumps(tiles, indent=2)

The output will be:

[
  [
    (0,1),
    {
      "end_pos": [
        10.0
      ],
      "x1": 10,
      "pos": [
        10.0
      ],
      "value": 6
    }
  ],
  [
    (0,0),
    {
      "end_pos": [
        1.0
      ],
      "x1": 1,
      "pos": [
        1.0
      ],
      "value": 5
    }
  ]
]

Notice that even though max_zoom is equal to 0, two tiles are generated. This is because the width of the lowest resolution tile (zoom_level=0) is equal to the width of the domain of the data. Because tile inclusion intervals are half-open, however, the second point will actually be in a second tile [0,1].

Saving Tiles

Compression

The output tiles can be gzipped by using the --gzip option. This will compress each tile file and append .gz to its filename. Using this options is generally recommended whenever tiles contain repetitive data and storage is at a premium.

Matrix data

Gridded (matrix) data can be output in two forms: sparse and dense format. Sparse format is useful for representing matrices where few of values have non-zero values. In this format, we specify the location of each cell that has a value along with its value.

Sparse Format

Used when there are not enough entries to make it worthwhile to return dense matrix format.

{
  "sparse": [
    {
      "value": 1.0,
      "pos": [
        2.5,
        2.5
      ],
    }
  ],
}

The sparse field contains the actual data points that need to be displayed in for this tile. At lower zoom levels, densely packed data needs to be abstracted so that only a handful of data points are shown. This can be done by binning and aggregating by summing or simply picking the most "important" points in the area that this tile occupies.

Dense Format

The output of a single tile is just an array of values under the key dense.

{
    "dense": [3.0, 0, 2.0, 1.0]
}

The encoding from bin positions, to positions in the flat array is performed thusly:

for (bin_pos, bin_val) in tile_entries_iterator.items():
    index = sum([bp * bins_per_dimension ** i for i,bp in enumerate(bin_pos)])
    initial_values[index] = bin_val

The translation from index to bin_pos is left as en exercise to the reader.

Elasticsearch

Tiles can also be saved directly to an elasticsearch database using the --elasticsearch-nodes and --elasticsearch-path options. Example

--elasticsearch-nodes localhost:9200 --elasticsearch-path test/tiles

When using Elasticsearch to serve tiles, make sure that cross-origin requests are allowed. This is done by adding the following two lines to your config/elasticsearch.yml file:

http.cors.enabled: true
http.cors.allow-origin: '*'

Usage Example

The directory tree and example files shown above can be generated using the following command:

OUTPUT_DIR=output; rm -rf $OUTPUT_DIR/*; python scripts/make_tiles.py -o $OUTPUT_DIR \
-v count -p pos1,pos2 -c pos1,pos2,count -i count -b 2 --max-zoom 1 \
test/data/smallFullMatrix.tsv

Or, for a more realistic data set:

OUTPUT_DIR=output; rm -rf $OUTPUT_DIR; python scripts/make_tiles.py -o $OUTPUT_DIR -v count -p pos1,pos2 -c pos1,pos2,count -i count -r 5000 -b 128 --max-zoom 14 data/128k.raw; du --max-depth=0 -h output; find output/ | wc -l;

The parameters:

Tests

There are a limited number of tests provided.

nosetests -s test/ 

Examples

Profiling

To see where this script spends the majority of the time, profile it using:

OUTPUT_DIR=output; rm -rf $OUTPUT_DIR/*; ./scripts/pyprof.sh scripts/make_tiles.py -o $OUTPUT_DIR -v count -p pos1,pos2 -c pos1,pos2,count -i count -b 512 --max-zoom 8 data/32k.raw

Note that to run this, you need to have gprof2dot installed.

Back of the envelope

Human genome: ~4 billion = 2 ^ 32 base pairs

One tile (at highest possible resolution): 256 = 2 ^ 8 base pairs

We would need to allow a zoom level of 24 to display the genome at a resolution of 1 bp / pixel.

To display it at 1K (~2 ^ 10) base pairs per pixel, we would need 14 zoom levels