alluxio

Alluxio Python library 1.0.0 provides API to interact with Alluxio servers.


Keywords
alluxio, big-data, python, storage
License
Apache-2.0
Install
pip install alluxio==1.0.0

Documentation

Alluxio Python Library

This repo contains the Alluxio Python API to interact with Alluxio servers, bridging the gap between computation frameworks and undelying storage systems. This module provides a convenient interface for performing file system operations such as reading, writing, and listing files in an Alluxio cluster.

Features

  • Directory listing and file status fetching
  • Put data to Alluxio system cache and read from Alluxio system cache (include range read)
  • Alluxio system Load operations with progress tracking
  • Support dynamic Alluxio worker membership services (ETCD periodically refreshing and manually specified worker hosts)

Limitations

Alluxio Python library supports reading from Alluxio cached data. The data needs to either

  • Loaded into Alluxio servers via load operations
  • Put into Alluxio servers via write_page operation.

If you need to read from storage systems directly with Alluxio on demand caching capabilities, please use alluxiofs instead.

Installation

Install from source

cd alluxio-python-library
python setup.py sdist bdist_wheel
pip install dist/alluxio_python_library-0.1-py3-none-any.whl

Usage

Initialization

Import and initialize the AlluxioFileSystem class:

# Minimum setup for Alluxio with ETCD membership service
alluxio = AlluxioFileSystem(etcd_hosts="localhost")

# Minimum setup for Alluxio with user-defined worker list
alluxio = AlluxioFileSystem(worker_hosts="worker_host1,worker_host2")

# Minimum setup for Alluxio with self-defined page size
alluxio = AlluxioFileSystem(
            etcd_hosts="localhost",
            options={"alluxio.worker.page.store.page.size": "20MB"}
            )
# Minimum setup for Alluxio with ETCD membership service with username/password
options = {
    "alluxio.etcd.username": "my_user",
    "alluxio.etcd.password": "my_password",
    "alluxio.worker.page.store.page.size": "20MB"  # Any other options should be included here
}
alluxio = AlluxioFileSystem(
    etcd_hosts="localhost",
    options=options
)

Load Operations

Dataset metadata and data in the Alluxio under storage need to be loaded into Alluxio system cache to read by end-users. Run the load operations before executing the read commands.

# Start a load operation
load_success = alluxio_fs.load('s3://mybucket/mypath/file')
print('Load successful:', load_success)

# Check load progress
progress = alluxio_fs.load_progress('s3://mybucket/mypath/file')
print('Load progress:', progress)

# Stop a load operation
stop_success = alluxio_fs.stop_load('s3://mybucket/mypath/file')
print('Stop successful:', stop_success)

(Advanced) Page Write

Alluxio system cache can be used as a key value cache system. Data can be written to Alluxio system cache via write_page command after which the data can be read from Alluxio system cache (Alternative to load operations).

success = alluxio_fs.write_page('s3://mybucket/mypath/file', page_index, page_bytes)
print('Write successful:', success)

Directory Listing

List the contents of a directory:

"""
contents = alluxio_fs.listdir('s3://mybucket/mypath/dir')
print(contents)

Get File Status

Retrieve the status of a file or directory:

status = alluxio_fs.get_file_status('s3://mybucket/mypath/file')
print(status)

File Reading

Read the entire content of a file:

"""
Reads a file.

Args:
    file_path (str): The full ufs file path to read data from

Returns:
    file content (str): The full file content
"""
content = alluxio_fs.read('s3://mybucket/mypath/file')
print(content)

Read a specific range of a file:

content = alluxio_fs.read_range('s3://mybucket/mypath/file', offset, length)
print(content)

Development

See Contributions for guidelines around making new contributions and reviewing them.