Read GCS and local paths with the same interface, clone of tensorflow.io.gfile


License
Unlicense
Install
pip install blobfile==0.10.1

Documentation

blobfile

This is a standalone clone of TensorFlow's gfile, supporting both local paths and gs:// (Google Cloud Storage) paths.

The main function is BlobFile, a replacement for GFile. There are also a few additional functions, basename, dirname, and join, which mostly do the same thing as their os.path namesakes, only they also support gs:// paths.

Installation:

pip install blobfile

Usage:

import blobfile as bf

with bf.BlobFile("gs://my-bucket-name/cats", "wb") as w:
    w.write(b"meow!")

Here are the functions:

  • BlobFile - like open() but works with gs:// paths too, data is streamed to/from the remote file.
    • Reading is done without downloading the entire remote file.
    • Writing is done to the remote file directly, but only in chunks of a few MB in size. flush() will not cause an early write.
    • Appending is not implemented.
    • You can specify a buffer_size on creation to buffer more data and potentially make reading more efficient.
  • LocalBlobFile - like BlobFile() but operations take place on a local file.
    • When reading, this is done by downloading the file during the constructor.
    • When writing, this means uploading the file on close() or during destruction.
    • When appending, the means downloading the file during construction and uploading on close().
    • You can pass a cache_dir parameter to cache files for reading. You are reponsible for cleaning up the cache directory.

Some are inspired by existing os.path and shutil functions:

  • copy - copy a file from one path to another, will do a remote copy between two remote paths on the same blob storage service
  • exists - returns True if the file or directory exists
  • glob - return files matching a pattern, on GCS this only supports a single * operator. In addition, it can be slow if the * appears early in the pattern since GCS can only do prefix matches; all additional filtering must happen locally
  • isdir - returns True if the path is a directory
  • listdir - list contents of a directory as a generator
  • makedirs - ensure that a directory and all parent directories exist
  • remove - remove a file
  • rmdir - remove an empty directory
  • rmtree - remove a directory tree
  • stat - get the size and modification time of a file
  • walk - walk a directory tree with a generator that yields (dirpath, dirnames, filenames) tuples
  • basename - get the final component of a path
  • dirname - get the path except for the final component
  • join - join 2 or more paths together, inserting directory separators between each component

There are a few bonus functions:

  • get_url - returns a url for a path along with the expiration for that url (or None)
  • md5 - get the md5 hash for a path, for GCS this is fast, but for other backends this may be slow
  • set_log_callback - set a log callback function log(msg: string) to use instead of printing to stdout