Synthetic data generation pipeline leveraging a Differentially Private Variational Auto Encoder assessed using a variety of metrics


Keywords
synthetic, data, privacy, fairness, machine, learning, nhs, pytorch, variational-autoencoder
License
MIT
Install
pip install nhssynth==0.3.0

Documentation

Tests Passing Lines of Code Percentage Comments Snyk Package Health

PyPI - Python Version PyPI - Package Status PyPI - Latest Release PyPI - Wheel PyPI - License Code style: black Ruff

NHS Synth

About

This repository currently consists of a Python package alongside research and investigative materials covering the effectiveness of the package and synthetic data more generally when applied to NHS use cases. See the internal project description for more information.

Getting Started

Project Structure

  • The main package and codebase is found in src/nhssynth (see Usage below for more information)
  • Accompanying materials are available in the docs folder:
    • The components used to create the GitHub Pages documentation site
    • A report summarising the previous iteration of this project
    • A model card providing more information about the VAE with Differential Privacy
  • Numerous exemplar configurations are found in config
  • Empty data and experiments folders are provided; these are the default locations for inputs and outputs when running the project using the provided CLI module
  • Pre-processing notebooks for specific datasets used to assess the approach and other non-core code can be found in auxiliary

Installation

For general usage, we recommend installing the package via pip install nhssynth in a supported python version environment. You can then import the package's modules and use them in your projects, or interact with the package directly via the CLI, which is accessed using the nhssynth command (see Usage for more information).

Secure Mode

Note that in order to train a generator in secure mode (see the documentation for details) you will need to install the PyTorch extension package csprng separately. Currently this package's dependencies are not compatible with recent versions of PyTorch (the author's plan on rectifying this - watch this space), so you will need to install it manually; for this we recommend following the instructions below:

git clone git@github.com:pytorch/csprng.git
cd csprng
git branch release "v0.2.2-rc1"
git checkout release
python setup.py install

Advanced Installation

If you intend on contributing or working with the codebase directly, or if you want to reproduce the results of this project, follow the steps below:

  1. Clone the repo

  2. Ensure one of the required versions of Python is installed

  3. Install poetry and either:

    • Skip to step four (and have poetry control the installation's virtual environment in their proprietary way)

    • Change poetry's configuration to manage your own virtual environments:

      poetry config virtualenvs.create false
      poetry config virtualenvs.in-project false

      You can now instantiate a virtual environment in the usual way (e.g. via python -m venv nhssynth) and activate it via source nhssynth/bin/activate before moving to the next step

  4. Install the project dependencies with poetry install (add optional flags: --with dev when developing and testing the package, --with aux to work with the auxiliary notebooks, --with docs to work with the documentation)

  5. You can then interact with the package in one of two ways:

    • Via the CLI module, which is accessed using the nhssynth command, e.g.

      poetry run nhssynth ...

      Note that you can omit the poetry run part and just type nhssynth if you followed the optional steps above to manage and activate your own virtual environment, or if you have executed poetry shell beforehand.

    • Through directly importing parts of the package to use in an existing project (from nhssynth.modules... import ...).

Usage

CLI

This package comprises a set of modules that can be run using the CLI individually, as part of a pipeline, or via a configuration file. These options are available via the aforementioned (poetry run) nhssynth command:

nhssynth <module name> --<args>
nhssynth pipeline --<args>
nhssynth config -c <name> --<overrides>

To see the modules that are available and their corresponding arguments, run nhssynth --help and nhssynth <module name> --help respectively.

Configuration files can be used to run the pipeline or a chosen set of modules. They should be placed in the config folder and their layout should match that of the examples provided. They can be run as in the latter case above by providing their filename (without the .yaml extension). You can also override any of the arguments provided in the configuration file by passing them as arguments in the command line.

To ensure reproducibility, you should always specify a --seed value and provide the --save-config flag to dump the exact configuration specified / inferred for the run (missing options will be populated in the outputted config, so it may be larger than one you would specify yourself). This configuration file can then be used in the future to reproduce the exact same run or shared with others to run the same configuration on their dataset, etc.

Python API

Alternatively, you may want to import parts of the package into your own project or notebook. There is a minimum working example of this in the auxiliary folder. You can learn more about the API and structure of the package and its modules in the docs to reuse components as you see fit.

Package Structure

The figure below shows the structure and workflow of the package and its modules.

View a visualisation of the codebase here!

Roadmap

See the open issues for a list of proposed features (and known bugs). Our milestones represent longer term goals for the project.

Contributing

Contributions are welcome! We encourage you to first raise an issue with your proposed contribution to enable discussion with the maintainers. After that, please follow these steps:

  1. Fork the project
  2. Create your branch (git checkout -b <yourusername>/<featurename>)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin <yourusername>/<featurename>)
  5. Open a PR and we will try to get it merged!

See CONTRIBUTING.md for detailed guidance.

Thanks to everyone that has contributed so far!

This codebase builds on previous NHSX Analytics Unit PhD internships contextualising and investigating the potential use of Variational Auto Encoders (VAEs) for synthetic data generation. These were undertaken by Dominic Danks and David Brind.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

This project is under active development by @HarrisonWilde. For feature requests and bugs, please raise an issue; for security concerns, please open a draft security advisory. Alternatively, contact NHS England TDAU.

To find out more about the Analytics Unit visit our project website or get in touch at england.tdau@nhs.net.