This is a standalone clone of TensorFlow's
gfile, supporting both local paths and
gs:// (Google Cloud Storage) paths.
The main function is
BlobFile, a replacement for
GFile. There are also a few additional functions,
join, which mostly do the same thing as their
os.path namesakes, only they also support
pip install blobfile
import blobfile as bf with bf.BlobFile("gs://my-bucket-name/cats", "wb") as w: w.write(b"meow!")
Here are the functions:
open()but works with
gs://paths too, data is streamed to/from the remote file.
- Reading is done without downloading the entire remote file.
- Writing is done to the remote file directly, but only in chunks of a few MB in size.
flush()will not cause an early write.
- Appending is not implemented.
- You can specify a
buffer_sizeon creation to buffer more data and potentially make reading more efficient.
BlobFile()but operations take place on a local file.
- When reading, this is done by downloading the file during the constructor.
- When writing, this means uploading the file on
close()or during destruction.
- When appending, the means downloading the file during construction and uploading on
- You can pass a
cache_dirparameter to cache files for reading. You are reponsible for cleaning up the cache directory.
Some are inspired by existing
copy- copy a file from one path to another, will do a remote copy between two remote paths on the same blob storage service
Trueif the file or directory exists
glob- return files matching a pattern, on GCS this only supports a single
*operator. In addition, it can be slow if the
*appears early in the pattern since GCS can only do prefix matches; all additional filtering must happen locally
Trueif the path is a directory
listdir- list contents of a directory as a generator
makedirs- ensure that a directory and all parent directories exist
remove- remove a file
rmdir- remove an empty directory
rmtree- remove a directory tree
stat- get the size and modification time of a file
walk- walk a directory tree with a generator that yields
(dirpath, dirnames, filenames)tuples
basename- get the final component of a path
dirname- get the path except for the final component
join- join 2 or more paths together, inserting directory separators between each component
There are a few bonus functions:
get_url- returns a url for a path along with the expiration for that url (or None)
md5- get the md5 hash for a path, for GCS this is fast, but for other backends this may be slow
set_log_callback- set a log callback function
log(msg: string)to use instead of printing to stdout