Parquet tools for CJWorkbench.
Workbench modules may optionally depend on the latest version of this Python
package for its cjwparquet.api.*
functions.
Installation
This is meant to be used within a Docker container. It depends on executables
/usr/bin/parquet-to-arrow
and /usr/bin/parquet-to-text-stream
.
Your Dockerfile might look something like this:
FROM workbenchdata/parquet-tools:v2.1.0 AS parquet-tools
FROM python:3.8.5-buster AS main
COPY --from=parquet-tools /usr/bin/parquet-to-arrow /usr/bin/parquet-to-arrow
COPY --from=parquet-tools /usr/bin/parquet-to-text-stream /usr/bin/parquet-to-text-stream
# And now that these binaries are accessible, you can install cjwparquet...
Usage
from pathlib import Path
import cjwparquet
import pyarrow
# Write a Parquet file
cjwparquet.write(Path("test.parquet"), pyarrow.table({"A": ["foo", "bar"]}))
# Test whether a file looks like a Parquet file
if cjwparquet.file_has_parquet_magic_number(Path("test.parquet")):
# Read a Parquet file
with cjwparquet.open_as_mmapped_arrow(Path("test.parquet")) as table:
assert table.to_pydict() == {"A": ["foo", "bar"]}
# Convert to text
text = cjwparquet.read_slice_as_text(
Path("test.parquet"),
format="csv",
only_columns=range(0, 20),
only_rows=range(0, 200),
)
assert text == "A\nfoo\nbar"
Developing
- Run tests:
docker build .
- Write a failing unit test in
tests/
- Make it pass by editing code in
cjwparquet/
black cjwparquet tests && isort cjwparquet tests
- Submit a pull request
Be very, very, very careful to preserve a consistent API. Workbench will upgrade this dependency without module authors' explicit consent. Add new features; fix bugs. Never change functionality.
Publishing
- Write a new
version=
tosetup.py
. Adhere to semver. (As changes must be backwards-compatible, the version will always start with1
and look like1.x.y
.) - Prepend notes to
CHANGELOG.md
about the new version git commit
git tag v1.x.y
git push --tags && git push
- Wait for Travis to push our changes to PyPI