Python components for DataONE clients and servers.
See the documentation on ReadTheDocs.
v2 and v1 API
- DataONE Generic Member Node: PyPI – Docs
- DataONE Client Library for Python: PyPI – Docs
- DataONE Common Library for Python: – PyPI – Docs
- DataONE Test Utilities: PyPI – Docs
- DataONE Command Line Client (CLI): PyPI – Docs
- DataONE ONEDrive: PyPI – Docs
- DataONE Certificate Extensions: PyPI
- DataONE Gazetteer: PyPI
- DataONE Ticket Generator: PyPI
- Google Foresite Toolkit: PyPI
Pull Requests (PRs) are welcome! Before you start coding, feel free to reach out to us and let us know what you plan to implement. We might be able to point you in the right direction.
We try to follow PEP8, with the main exception being that we use two instead of four spaces per indent.
To help keep the style consistent and commit logs, blame/praise and other code annotations accurate, we use the following
pre-commit hooks to automatically format and check Python scripts before committing to GitHub:
- YAPF - PEP8 formatting with DataONE modifications
- isort - Sort and group imports
- trailing-whitespace - Remove trailing whitespace
- Flake8 - Lint, code and style validation
Configuration files for YAPF (
./.flake8), isort (
./.isort.cfg) and Flake8
./.style.yapf) are included, and show the formatting options we have
Contributors are encouraged to set up the hooks before creating PRs. This can be done automagically with pre-commit, for which a configuration file is also included.
To set up automatic validation and formatting:
$ sudo pip install pre-commit $ cd <a folder in the Git working tree for the repository> $ pre-commit autoupdate --bleeding-edge $ pre-commit install
--bleeding-edgeis required as shown above at the time of writing, Oct 2018.
trailing-whitespacehooks modify any of the files being committed, the hooks will show as
Failedand the commit is aborted. This provides an opportunity to examine the reformatted files and run the unit and integration tests again in order make sure the reformat did not break anything. The modified files can then be staged and committed again. If no new modifications have been made, the commit then goes through, with the hooks showing a status of
A convenient command to "restage" the files modified by pre-commit:
$ git update-index --again
Or, to add a shortcut:
$ git config --global alias.restage "update-index --again" $ git restage
Flake8only performs validation, not formatting. If validation fails, the issues should be fixed before committing. The modifications may then trigger a new formatting by
trailing-whitespace, thus requiring the files to be staged and commited again.
If desired, the number of extra staging and commits caused by reformatting and validation can be reduced with workflow adjustments:
trailing whitespace: Use an editor that can strip trailing whitespace on save. E.g., for PyCharm, this setting is at
Editor > General > Strip trailing spaces on Save.
YAPF formatting: Call
YAPFmanually on the file before commit.
YAPFsearches from current directory and up in the tree for configuration files. So, as long as current directory is in the repository root or below,
YAPFshould pick up and use the configuration that is included in the repository. To call
YAPFmanually, it can either be installed separately, or an alias can be set up to call the version that
pre-commithas installed into its own venv.
Flake8 validation: the same procedure as for
YAPFcan be used, as
Flake8searches for its configuration file in the same way. In addition, IDEs can typically do code inspections and tag issues directly in the UI, where they can be handled before commit.
Testing is based on the pytest unit test framework.
Most of our tests work by serializing objects generated by the code being tested and comparing them with reference samples stored in files. This allows us to check all properties of generated objects without having to write asserts that check individual properties, eliminating a time consuming and repetitive part of the test writing process.
When writing comparisons manually, one will often select a few properties to check, and when those are determined to be valid, the remaining values are assumed to be correct as well. By comparing complete serialized versions of the objects, we avoid such assumptions.
By storing the expected serialized objects in files instead of in the unit tests themselves, we avoid embedding hard coded documents inside the unit test modules and make it simple to automatically update the expected contents of objects as the code evolves.
When unit tests are being run as part of CI or as a normal guard against regressions in a local development environment, any mismatches between actual and expected serialized versions of objects simply trigger test failures. However, when a test is initially created or the serialized version of an object is expected to change, tests can automatically write or update the sample files they use. This function is enabled by starting
pytest with the
--sample-ask switch. When enabled, missing or mismatched sample files will not trigger test failures, instead starting an interactive process where differences are displayed together with yes/no prompts for writing or updating the samples. By default, differences are displayed in a GUI window using
kdiff3, which provides a nice color coded view of the differences.
The normal procedure for writing a sample based unit test is to just write the test as if the sample already exists, then running the test with
--sample-ask and viewing and approving the resulting sample, which is then automatically written to a file. The sample file name is displayed, making it easy to find the file in order to add it to tracking so that it can be committed along with the test module.
When working on large changes that cause many samples to become outdated, reviewing and approving samples can be deferred until the new code approaches stability. This is done by running the tests with
--sample-update, which automatically writes or updates samples to match the current results. Then, view and approve the tests with
--sample-review before committing.
Typically, it is not desirable to track generated files in Git. However, although the sample files are generated, they are an integral part of the units tests, and should be tracked just like the unit tests themselves.
Also implemented is a simple process for cleaning out unused sample files. Sample files are often orphaned when their corresponding tests are removed or refactored. The process is activated with the
--sample-tidy switch. When active, the test session starts by moving all sample files from their default directory,
test_docs_tidy. As the sample files are accessed by tests, they are automatically moved back to
test_docs, and any files remaining in
test_docs_tidy after a complete test run can be untracked and deleted.
test_docs, stage the directory, so that new files are included, and deleted files get deleted on the server:
$ git add test_utilities/src/d1_test/test_docs $ git commit -m 'Update samples'
DataONE Client to Django test adapter
GMN tests are based on an adapter that enables using d1_client with the Django test framework. The adapter mocks Requests to issue requests through the Django test client.
Django includes a test framework with a test client that provides an interface that's similar to that of an HTTP client, but calls Django internals directly. The client enables testing of most functionality of a Django app without actually starting the app as a network service.
For testing GMN's D1 REST interfaces, we want to issue the test requests via the D1 MN client. Without going through the D1 MN client, we would have to reimplement much of what the client does, related to formatting and parsing D1 REST requests and responses.
This module is typically used in tests running under django.test.TestCase and requires an active Django context, such as the one provided by
Command line switches
We have added some custom functionality to pytest which can be enabled by launching pytest with the following switches:
--sample-ask: Enable a mode that display diffs and, after user confirmation, can automatically update or write new test sample documents on mismatches.
Automatically open files where errors occur and move the cursor to the line of the error
Show syntax highlighted diffs for scripts and data files using PyCharm's powerful diff viewer
Also requires the path to the PyCharm binary to be configured in
./conftest.pyfor implementation and notes.
parameterize_dict: Support for parameterizing test functions by adding a dict class member containing parameter sets.
Note: None of these switches can be used when running tests in parallel with xdist (
Debugging tests with PyCharm
By default, the PyCharm
Run context configuration (Ctrl+Shift+F10)will generate test configurations and run the tests under the native unittest framework in Python's standard library. This will cause the tests to fail, as they require pytest. To generate pytest configurations by default, set
Settings > Tools > Python Integrated Tools > Default test runnerto pytest. See the documentation for details.
Generate and run a configuration for a specific test by placing the cursor on a test function name and running
Run context configuration (Ctrl+Shift+F10).
After generating the configuration, debug with
If running the tests outside of PyCharm, launching
--pycharmswitch will cause
pytestto attempt to move the cursor in PyCharm to the location of any tests failures as they occur. This should be used with the
Stopping a test that has hit a breakpoint in PyCharm can cause the test database to be left around. On the next run, Django will then prompt the user to type "yes" to remove the database. The prompt appears in the PyCharm debug console output. To disable the prompt, go to
Run / Debug Configurations > Edit Configurations > Defaults > Django tests > Optionsand add
--noinput. See the question on SO for details.
pytestby default captures
stderroutput for the tests and only shows the output for the tests that failed after all tests have been completed. Since a test that hits a breakpoint has not yet failed, this hides any output from tests being debugged and also hides output from the debug console prompt (where Python script can be evaluated in the current context). To see the output while debugging, go to
Run / Debug Configurations > Edit Configurations > Defaults > pytest > Additional Argumentsand add
--capture=no. Also add an environment variable
JB_DISABLE_BUFFERINGand set it to
--capture=no --exitfirst --verbose. Verbosity can also be increased by adding one or more
Each unit test is implicitly wrapped in a database transaction and I have not found a way around this. The effect is that it's cumbersome to check the current state of the database while at a breakpoint or stepping through tests. PyCharm's database tools will only see the database as it was before the test was started. The only workaround I've found is to manually issue queries from within the current context, using the PyCharm console. While stepping through the test, bring up the console,
View > Tool Windows > Python Console, and click
Show Python Prompt. Then submit queries with, e.g.,
> self.run_django_sql('select count(*) from app_scienceobject'). Write them in the database console to get the code completion and other features, then copy it into a call in the Python console. If an invalid query is submitted, the current database transaction will be lost. If there is no output when running commands in the console, it's due to the output being captured by pytest. See above.
The settings in
settings_test.pyare optimized for testing and debugging, while the settings in
settings_template.pyare optimized for production. To use
settings_test.pywhen debugging tests in PyCharm, go to
Run / Debug Configurations > Edit Configurations > Defaults > pytest > Environment variables, add
DJANGO_SETTINGS_MODULEand set it to
Testing of the GMN Django web app is based on pytest and pytest-django.
The tests use
settings_test.pyfor GMN and Django configuration.
pytest_django/plugin.py. To set
settings.DEBUG, override it close to where it will be read, e.g., wit
Django database test fixture
The GMN tests run in the context of a database that has been prepopulated with randomized data. The fixture file for the database is a JSON file stored in
Set up a blank database to be populated with test data:
$ sudo -u postgres dropdb --if-exists gmn_test_db_template $ sudo -u postgres createdb -E UTF8 --owner=<your user name> gmn_test_db_template $ ./gmn/src/d1_gmn/manage.py migrate --settings settings_test --database template --run-syncdb
Regenerate the fixture file:
gmn_test_db_template, must match the name of the database that is set up for the dict key
After changing any of the ORM classes in models.py, the database test fixture must be regenerated. This will often cause sample files to have to be updated as well.
Fixtures can be loaded directly into the test database from the JSON files but it's much faster to keep an extra copy of the db as a template and create the test db as needed with Postgres' "create database from template" function. So we only load the fixtures into a template database and reuse the template. This is implemented in
Science object bytes are stored on disk, so they are not captured in the db fixture. If a test needs get(), getChecksum() and replica() to work, it must first create the correct file in GMN's object store or mock object store reads. The bytes are predetermined for a given test PID. See
Setting up the development environment
These instructions are tested on Linux Mint 18 and should also work on close derivatives.
Install packaged dependencies
$ sudo apt update $ sudo apt -fy dist-upgrade $ sudo apt install -y python-setuptools libssl-dev postgresql postgresql-server-dev-all git
$ sudo apt install -y python3-dev python3-venv $ python3 -m venv venv
$ sudo apt install -y python-dev python-virtualenv
Python 3 and Python 2
$ . ./venv/bin/activate
Download the source from GitHub:
$ git clone https://github.com/DataONEorg/d1_python.git
Add the DataONE packages to the Python path, and install their dependencies:
cd ~/d1_python sudo ./dev_tools/src/d1_dev/setup-all.py --root . develop
$ sudo apt install --yes postgresql
Set the password of the postgres superuser account:
$ sudo passwd -d postgres $ sudo su postgres -c passwd
When prompted for the password, enter a new superuser password (and remember it :-).
$ sudo -u postgres createdb -E UTF8 gmn2 $ sudo -u postgres createuser --superuser `whoami`
PyCharm (and other IntelliJ based platforms), are not able to connect to database with local (UNIX) sockets. Postgres' convenient "peer" authentication type only works over local sockets. A convenient workaround for this is to set Postgres up to trust local connections made over TCP/IP.
$ sudo editor /etc/postgresql/10/main/pg_hba.conf
host all all 127.0.0.1/32 trust
A similar line for MD5 may already be present and, if so, must be commented out.
Run the following commands (all sections), except, change the location for openssl.cnf, so the line that copies it becomes:
$ sudo cp <your_d1_python_path>/d1_mn_generic/src/deployment/openssl.cnf .
Run the tests and verify that they all pass:
Set up credentials for working with the DataONE account on PyPI:
[server-login] username: dataone password: <secret>
Creating a new release
TODO: Move from pip to pipenv. https://docs.pipenv.org/
Update all packages managed by pip:
$ cd d1_python $ ./dev_tools/src/d1_dev/pip-update-all.py
requirements.txt file contains a list of packages and pinned versions that will be used in CI builds. It designates the exact Python environment in which the unit tests will run in CI builds.
$ pip freeze > requirements.txt
The DataONE Python stack specifies the versions that were tested in CI builds before release as the lowest required versions, and allows any later versions to be installed as part of regular maintenance.
As updating the versions in the
setup.py files manually is time consuming and error prone, a script is included that automates the task. The script updates the version information for the dependencies in the
setup.py files to match the versions of the currently installed dependencies. Run the script with:
$ cd d1_python $ src-sync-dependencies.py . <version>
<version> argument specifies what the version will be for the release. E.g.,
"2.3.1". We keep the version numbers in sync between all of the packages in the d1_python git repository, so only one version string needs to be specified.
Check that there are no package version conflicts:
$ pip check
Commit and push the changes, and check the build on Travis.
Building the release packages
After successful build, clone a fresh copy, which will be used for building the release packages:
$ cd ~ $ rm -rf ~/d1_python_build $ git clone email@example.com:DataONEorg/d1_python.git d1_python_build
Building the release packages from a fresh clone is a simple way of ensuring that only tracked files are released. It is a workaround for the way setuptools works, which is basically that it vacuums up everything that looks like a Python script in anything that looks like a package, which makes it easy to publish local files by accident.
Build and publish the packages:
cd ~/d1_python_build setup-all.py --root . bdist_wheel upload
Building the documentation
d1_python is pushed to GitHub, a signal is sent by GitHub to ReadTheDocs.org, which automatically retrieves the new version of the project from GitHub, builds the documentation and makes it available at
So it is not absolutely necessary to have a local build environment set up for the documentation, but building locally provides faster feedback when making changes that need to be checked before publishing.
Clear out the installed libraries and reinstall:
$ sudo rm -rf /usr/local/lib/python2.7/dist-packages/d1_* $ sudo nano /usr/local/lib/python2.7/dist-packages/easy-install.pth Remove all lines that are: dataone.*.egg and that are paths to your d1_python.