karps

Experimental Haskell bindings to Spark Datasets and DataFrames


Keywords
apache, library, test, Spark.Core, Spark.Core.Column, Spark.Core.ColumnFunctions, Spark.Core.Context, Spark.Core.Dataset, Spark.Core.Functions, Spark.Core.Internal.Arithmetics, Spark.Core.Internal.ArithmeticsImpl, Spark.Core.Internal.Caching, Spark.Core.Internal.CanRename, Spark.Core.Internal.Client, Spark.Core.Internal.ColumnStandard, Spark.Core.Internal.ComputeDag, Spark.Core.Internal.ContextIOInternal, Spark.Core.Internal.ContextInteractive, Spark.Core.Internal.ContextInternal, Spark.Core.Internal.ContextStructures, Spark.Core.Internal.DAGFunctions, Spark.Core.Internal.DAGStructures, Spark.Core.Internal.DatasetFunctions, Spark.Core.Internal.DatasetStructures, Spark.Core.Internal.Groups, Spark.Core.Internal.Joins, Spark.Core.Internal.LocalDataFunctions, Spark.Core.Internal.ObservableStandard, Spark.Core.Internal.OpFunctions, Spark.Core.Internal.OpStructures, Spark.Core.Internal.Paths, Spark.Core.Internal.PathsUntyped, Spark.Core.Internal.Projections, Spark.Core.Internal.Pruning, Spark.Core.Internal.RowGenericsFrom, Spark.Core.Internal.TypesFunctions, Spark.Core.Internal.TypesGenerics, Spark.Core.Internal.TypesStructures, Spark.Core.Internal.TypesStructuresRepr, Spark.Core.Internal.Utilities, Spark.Core.Row, Spark.Core.StructuresInternal, Spark.Core.Try, Spark.Core.Types, Spark.IO.Inputs
License
Apache-2.0
Install
cabal install karps-0.2.0.0

Documentation

Karps-Haskell - Haskell bindings for Spark Datasets and Dataframes

This project is an exploration vehicle for developing safe, robust and reliable data pipelines over Apache Spark. It consists in multiple sub-projects:

  • a specification to describe data pipelines in a language-agnostic manner, and a communication protocol to submit these pipelines to Spark. The specification is currently specified in this repository, using Protocol Buffers 3 ( which is also compatible with JSON).
  • a serving library, called karps-server, that implements this specification on top of Spark. It is written in Scala and is loaded as a standard Spark package.
  • a client written in Haskell that sends pipelines to Spark for execution. In addition, this client serves as an experimental platform for whole-program optimization and verification, as well as compiler-enforced type checking.

There is also a separate set of utilities to visualize such pipelines using Jupyter notebooks and IHaskell.

This is a preview, the API may (will) change in the future.

The name is a play on a tasty fish of the family Cyprinidae, and an anagram of Spark. The programming model is strongly influenced by the TensorFlow project and follows a similar design.

Karps can also take advantage of the Haskell kernel for Jupyter, which provides a better user experience and comes with beautiful introspection tools courtesy of the TensorBoard server. Using Tensorboard, you can visualize, drill down, introspect the graph of computations:

image

Examples.

Some notebooks that showcase the current capabilities are in the notebooks directory. Some prerendered versions are also available. Chrome seems to provide the best experience when playing interactively with the visualizations.

Installation (for users)

These instructions assume that the following software is installed on your computer:

Launching Spark locally Assuming the SPARK_HOME environment variable is set to the location of your current installation of Spark, run:

$SPARK_HOME/bin/spark-shell --packages krapsh:karps-server:0.2.0-s_2.11\
   --name karps-server --class org.karps.Boot --master "local[1]" -v

You should see a flurry of log messages that ends with something like: WARN SparkContext: Use an existing SparkContext, some configuration may not take effect. The server is now running.

Connecting the Karps-Haskell client All the integration tests should be able to connect to the server and execute some Spark commands:

stack build
stack test

You are now all set to run your first interactive program:

stack ghci
import Spark.Core.Dataset
import Spark.Core.Context
import Spark.Core.Functions
let ds = dataset ([1 ,2, 3, 4]::[Int])
let c = count ds

createSparkSessionDef defaultConf
mycount <- exec1Def c

Installation (GUI, for users)

IHaskell can be challenging to install, so a docker installation script is provided. You will need to install Docker on your computer to run Karps with IHaskell.

In the project directory, run:

docker build -t ihaskell-karps .
docker run -it --volume $(pwd)/notebooks:/karps/notebooks \
  --publish 8888:8888  ihaskell-karps

The notebooks directory contains some example notebooks that you can run.

Note that it still requires a running Spark server somewhere else: the docker container only runs the Haskell part.

MacOS users When running Docker with OSX, you may need to tell Docker how to communicate from inside a container to the local machine (if you run Spark outside Docker). Here is a command to launch Docker with the appropriate options:

docker run -it --volume $(pwd)/notebooks:/karps/notebooks \
  --publish 8888:8888 --add-host="localhost:10.0.2.2" ihaskell-karps

Standalone linux installation The author cannot support the vagaries of operating systems, especially when involving IHaskell, but here is a setup that has shown some success:

In Ubuntu 16.04, install all the requirements of IHaskell (libgmp3-dev ghc ipython cabal-install, etc.)

In the kraps-haskell directory, run the following commands:

export STACK_YAML=$PWD/stack-ihaskell.yaml
stack setup 7.10.2
# This step may be required, depending on your version of stack.
# You will see it if you encounter some binary link issues.
stack exec -- ghc-pkg unregister cryptonite --force

stack update
stack install ipython-kernel-0.8.3.0
stack install ihaskell-0.8.3.0
stack install ihaskell-blaze-0.3.0.0
stack install ihaskell-basic-0.3.0.0
stack install

ihaskell install --stack
stack exec --allow-different-user -- jupyter notebook --NotebookApp.port=8888 '--NotebookApp.ip=*' --NotebookApp.notebook_dir=$PWD

Status

This project has so far focused on solving the most challenging issues, at the expense of breadth and functionality. That being said, the basic building blocks of Spark are here:

  • dataframes, datasets and observables (the results of collect)
  • basic data types: ints, strings, arrays, structures (both nullable and strict)
  • basic arithmetic operators on columns of data
  • converting between the typed and untyped operations
  • grouping, joining

You can take a look at the notebooks in the notebooks directory to see what is possible currently.

What is missing? A lot of things. In particular, users will most probably miss:

  • an input interface. The only way to use the bindings is currently to pass a list of data.
  • filters
  • long types, floats, doubles
  • broadcasting observables (scalar * col). This one is interesting and is probably the next piece.
  • setting the number of partitions of the data

Contributions

Contributions are most welcome. This is the author's first Haskell project, so all suggestions regarding style, idiomatic code, etc. will be gladly accepted. Also, if someone wants to setup a style checker, it will be really helpful.

Theory

The API and design goals are slightly more general than Spark's. A more thorough explanation can be found in the INTRO.md file.