spark-tarantool-connector

Apache Spark connector for Tarantool and Tarantool Cartridge

Building

Build the project using sbt (just run command sbt test).

Linking

You can link against this library for Maven in your program at the following coordinates:

<dependency>
  <groupId>io.tarantool</groupId>
  <artifactId>spark-tarantool-connector</artifactId>
  <version>0.7.0</version>
</dependency>

or for sbt:

libraryDependencies += "io.tarantool" %% "spark-tarantool-connector" % "0.7.0"

Version Compatibility

Connector	Scala	Apache Spark	Tarantool Server	Tarantool Cartridge Driver
0.x.x	2.11.12	2.4	1.10.9+, 2.4+	0.10.1+
0.x.x	2.12.16	3.2	1.10.9+, 2.4+	0.10.1+
0.x.x	2.13.10	3.2	1.10.9+, 2.4+	0.10.1+

Getting Started

Configuration properties

property-key	description	default value
tarantool.hosts	comma separated list of Tarantool hosts	127.0.0.1:3301
tarantool.username	basic authentication user	guest
tarantool.password	basic authentication password
tarantool.connectTimeout	server connect timeout, in milliseconds	1000
tarantool.readTimeout	socket read timeout, in milliseconds	1000
tarantool.requestTimeout	request completion timeout, in milliseconds	2000
tarantool.connections	number of connections established with each host	1
tarantool.cursorBatchSize	default limit for prefetching tuples in RDD iterator	1000
tarantool.retries.errorType	configures automatic retry of requests to Tarantool cluster. Possible values: "network", "none"	none
tarantool.retries.maxAttempts	maximum number of retry attempts for each request. Mandatory if errorType is set to "network"
tarantool.retries.delay	delay between subsequent retries of each request (in milliseconds). Mandatory if errorType is set to "network"

Dataset API request options

property-key	description	default value
tarantool.space	Tarantool space name. Mandatory option
tarantool.batchSize	limit of records to be read or written at once	1000
tarantool.stopOnError	stop writing immediately after a batch fails with an exception or not all tuples are written	true
tarantool.rollbackOnError	rollback all changes written in scope of the last batch to a replicaset where an exception occurred	true
tarantool.transformFieldNames	possible values: none (default), snake_case, lower_case, upper_case. Necessary if the field names in datasets built from Spark SQL queries does not correspond to the field names in Tarantool	none

Prerequisites

The Spark connector requires a deployed Tarantool Cartridge application with tarantool/crud module installed. See the version compatibility table in the previous section.

You may also use an official Tarantool Docker image, but it requires configuring the Cartridge cluster, so it is recommended to take the example configuration from the connector tests.

Example

Using Scala:

    // 1. Set up the Spark session
    val spark = SparkSession.builder()
       .config("tarantool.hosts", "127.0.0.1:3301")
       .config("tarantool.username", "admin")
       .config("tarantool.password", "password")
       .getOrCreate()
    
    val sc = spark.sparkContext
    
    // 2. Load the whole space
    val rdd: Array[TarantoolTuple] = sc.tarantoolSpace("test_space").collect()

    // 3. Filter using conditions
    // This mapper will be used implicitly for tuple conversion
    val mapper = DefaultMessagePackMapperFactory.getInstance().defaultComplexTypesMapper()
    
    val startTuple = new DefaultTarantoolTupleFactory(mapper).create(List(1).asJava)
    val cond: Conditions = Conditions
        .indexGreaterThan("id", List(1).asJava)
        .withLimit(2)
        .startAfter(startTuple)
    val tuples: Array[TarantoolTuple] = sc.tarantoolSpace("test_space", cond).collect()

    // 4. Load the whole space into a DataFrame
    val df = spark.read
      .format("org.apache.spark.sql.tarantool")
      .option("tarantool.space", "test_space")
      .load()
    
    // Space schema from Tarantool will be used for mapping the tuple fields
    val tupleIDs: Array[Int] = df.select("id").rdd.map(row => row.get(0)).collect()

    // 5. Write a Dataset to a Tarantool space

    // Convert objects to Rows
    val rows = Seq(
      Book(1, null, "Don Quixote", "Miguel de Cervantes", 1605),
      Book(2, null, "The Great Gatsby", "F. Scott Fitzgerald", 1925),
      Book(2, null, "War and Peace", "Leo Tolstoy", 1869)
    ).map(obj => Row(obj.id, obj.bucketId, obj.bookName, obj.author, obj.year))

    // Extract an object schema using build-in Encoders
    val orderSchema = Encoders.product[Book].schema

    // Populate the Dataset
    val ds = spark.createDataFrame(rows, orderSchema)

    // Write to the space. Different modes are supported
    ds.write
      .format("org.apache.spark.sql.tarantool")
      .mode(SaveMode.Overwrite)
      .option("tarantool.space", "test_space")
      .save()

or Java:

    // 1. Set up the Spark context
    SparkConf conf = new SparkConf()
        .set("tarantool.hosts", "127.0.0.1:3301")
        .set("tarantool.username", "admin")
        .set("tarantool.password", "password");

    JavaSparkContext jsc = new JavaSparkContext(conf);

    // 2. Load all tuples from a space using custom tuple to POJO conversion
    List<Book> tuples = TarantoolSpark.contextFunctions(jsc)
        .tarantoolSpace("test_space", Conditions.any(), t -> {
            Book book = new Book();
            book.id = t.getInteger("id");
            book.name = t.getString("name");
            book.author = t.getString("author");
            book.year = t.getInteger("year");
            return book;
        }, Book.class).collect();
    
    // 3. Load all tuples from a space into a Dataset
    Dataset<Row> ds = spark().read()
        .format("org.apache.spark.sql.tarantool")
        .option("tarantool.space", "test_space")
        .load();

    ds.select("id").rdd().toJavaRDD().map(row -> row.get(0)).collect();
    
    // 4. Write a Dataset to a Tarantool space
        
    // Create the schema first
    StructField[] structFields = new StructField[5];
    structFields[0] = new StructField("id", DataTypes.IntegerType, false, Metadata.empty());
    structFields[1] = new StructField("bucket_id", DataTypes.IntegerType, false, Metadata.empty());
    structFields[2] = new StructField("book_name", DataTypes.StringType, false, Metadata.empty());
    structFields[3] = new StructField("author", DataTypes.StringType, false, Metadata.empty());
    structFields[4] = new StructField("year", DataTypes.IntegerType, true, Metadata.empty());

    StructType schema = new StructType(structFields);

    // Populate the Dataset
    List<Row> data = new ArrayList<>(3);
    data.add(RowFactory.create(1, null, "Don Quixote", "Miguel de Cervantes", 1605));
    data.add(RowFactory.create(2, null, "The Great Gatsby", "F. Scott Fitzgerald", 1925));
    data.add(RowFactory.create(3, null, "War and Peace", "Leo Tolstoy", 1869));

    Dataset<Row> ds = sqlContext.createDataFrame(data, schema);

    // Write to the space. Different modes are supported
    ds.write()
        .format("org.apache.spark.sql.tarantool")
        .mode(SaveMode.Overwrite)
        .option("tarantool.space", "test_space")
        .save();

Supported DataSet write modes

Consult with the following table about what will happen when a DataSet is written with different modes. In all modes it is supposed that all the spaces used in an operation exist. An error will be produced otherwise.

Mode	How it works
Append	If a record with the given primary key exists, it will be replaced, and inserted otherwise.
Overwrite	The space will be truncated before writing the DataSet, and then the records will be inserted.
ErrorIfExists	If the space is not empty, an error will be produced; otherwise, the records will be inserted.
Ignore	If the space is not empty, no records will be insertedd an no errors will be produced.

Batch writing modes

Batch operations are supported for more efficient writing of data into the Tarantool cluster. They are enabled by default, but the error handling differs depending on values of the options rollbackOnError and stopOnError. The first option is simply propagated to the tarantool/crud library methods and currently only allows rolling back last batch of changes on a single replicaset when an exception has occurred with a tuple from this replicaset. The data successfully written to other replicasets in scope of the failed batch, and the data written in the previous batches will remain in place. The second option is also propagated to the tarantool/crud library. If it is set to false, the writing of batches will continue even in the case of errors. The list of errors will be returned when all data are attempted to be written to the cluster. This variant may be useful for the Append write mode only. If the stopOnError value is true (default), the batch writing will stop on the next batch after a batch fails with an exception or not all tuples in the last batch were written.

Learn more

Tarantool

io.tarantool:spark-tarantool-connector_2.12
Release 0.4.0

Release 0.4.0

0.7.0

0.6.0

0.5.3

0.5.2

0.5.1

0.5.0

0.4.0

0.3.0

0.2.0

0.1.1

Documentation

spark-tarantool-connector

Building

Linking

Version Compatibility

Getting Started

Configuration properties

Dataset API request options

Prerequisites

Example

Supported DataSet write modes

Batch writing modes

Learn more

Stats

Development practices

Releases

Contributors

io.tarantool:spark-tarantool-connector_2.12 Release 0.4.0

Release 0.4.0 Toggle Dropdown 0.7.0 0.6.0 0.5.3 0.5.2 0.5.1 0.5.0 0.4.0 0.3.0 0.2.0 0.1.1

Documentation

spark-tarantool-connector

Building

Linking

Version Compatibility

Getting Started

Configuration properties

Dataset API request options

Prerequisites

Example

Supported DataSet write modes

Batch writing modes

Learn more

Stats

Development practices

Releases

Contributors

io.tarantool:spark-tarantool-connector_2.12
Release 0.4.0

Release 0.4.0

0.7.0

0.6.0

0.5.3

0.5.2

0.5.1

0.5.0

0.4.0

0.3.0

0.2.0

0.1.1