Data Preparation Library for Spark
A Scala / Java / Python library for cleaning, transforming and executing other preparation tasks for large datasets on Apache Spark.
It is currently maintained by a team of developers from ThoughtWorks.
Post questions and comments to the Google group, or email them directly to firstname.lastname@example.org
Our aim is to provide a set of algorithms for cleaning and transforming very large data sets,
inspired by predecessors such as Open Refine, Pandas and Scikit-learn packages.
- Official source code repo: https://github.com/data-commons/prep-buddy
- Scala docs (development version): http://data-commons.github.io/prep-buddy/scaladocs
- Download releases: Latest Release
- Issue tracker: Github
- Mailing list: email@example.com
To use this library, add a maven dependency to datacommons in your project:
<dependency> <groupId>com.thoughtworks.datacommons</groupId> <artifactId>prep-buddy</artifactId> <version>0.5.1</version> </dependency>
For other build tools check on Maven Repositry
If you don't have pip. Intsall pip.
pip install prep-buddy
For using pyspark on command-line Download the Jar.
pyspark --jars [PATH-TO-JAR]
spark-submit --driver-class-path [PATH-TO-JAR] [Your python file.]
This library is currently built for Spark 1.6.x, but is also compatible with 1.4.x.
The library depends on a few other libraries.
- Apache Commons Math for general math and statistics functionality.
- Apache Spark for all the distributed computation capabilities.
- Open CSV for parsing the files.
- Stable 0.5.1(Beta).
- Create a pull request.