ibmos2spark
The ibmos2spark
library facilitates data read/write connections between Apache Spark clusters and the various
IBM Object Storage services.
Object Storage Documentation
- Cloud Object Storage
- Cloud Object Storage (IaaS)
- Object Storage OpenStack Swift (IaaS)
- Object Storage OpenStack Swift for Bluemix
Requirements
- Apache Spark with
stocator
library
The easiest way to install the stocator
library with Apache Spark is to
pass the Maven coordinates at launch.
Other installation options are described in the stocator
documentation.
Apache Spark at IBM
The stocator
and ibmos2spark
libraries are pre-installled and available on
Languages
The library is implemented for use in Python, R and Scala/Java.
Details
This library only does two things.
-
Uses the
SparkContext.hadoopConfiguration
object to set the appropriate keys to define a connection to an object storage service. - Provides the caller with a URL to objects in their object store, which are typically passed to a SparkContext object to retrieve data.
Example Usage
The following code demonstrates how to use this library in Python and connect to the Cloud Object Storage service, described in the far left pane of the image above.
import ibmos2spark
credentials = {
'auth_url': 'https://identity.open.softlayer.com', #your URL might be different
'project_id': '',
'region': '',
'user_id': '',
'username': '',
'password': '',
}
configuration_name = 'my_bluemix_objectstore' #you can give any name you like
bmos = ibmos2spark.bluemix(sc, credentials, configuration_name) #sc is the SparkContext instance
container_name = 'some_name'
object_name = 'file_name'
data_url = bmos.url(container_name, object_name)
data = sc.textFile(data_url)