HardLink/Deduplication Backups with Python


Keywords
Backup, Hardlink, Windows, Linux
License
GPL-3.0
Install
pip install pyhardlinkbackup==0.13.0

Documentation

pyhardlinkbackup

Hardlink/Deduplication Backups with Python.

  • Backups should be saved as normal files in the filesystem:
    • accessible without any extra software or extra meta files
    • non-proprietary format
  • Create backups with versioning
    • every backup run creates a complete filesystem snapshot tree
    • every snapshot tree can be deleted, without affecting the other snapshots
  • Deduplication with hardlinks:
    • Store only changed files, all other via hardlinks
    • find duplicate files everywhere (even if renamed or moved files)
  • useable under Windows and Linux

Requirement: Python 3.6 or newer.

Please: try, fork and contribute! ;)

Build Status on github github.com/jedie/pyhardlinkbackup/actions
Build Status on travis-ci.org travis-ci.org/jedie/pyhardlinkbackup
Build Status on appveyor.com ci.appveyor.com/project/jedie/pyhardlinkbackup
Coverage Status on coveralls.io coveralls.io/r/jedie/pyhardlinkbackup
Requirements Status on requires.io requires.io/github/jedie/pyhardlinkbackup/requirements/

Example

$ phlb backup ~/my/important/documents
...start backup, some time later...
$ phlb backup ~/my/important/documents
...

This will create deduplication backups like this:

~/pyhardlinkbackups
  └── documents
      ├── 2016-01-07-085247
      │   ├── phlb_config.ini
      │   ├── spreadsheet.ods
      │   ├── brief.odt
      │   └── important_files.ext
      └── 2016-01-07-102310
          ├── phlb_config.ini
          ├── spreadsheet.ods
          ├── brief.odt
          └── important_files.ext

Installation

Windows

  1. install Python 3: https://www.python.org/downloads/
  2. Download the file boot_pyhardlinkbackup.cmd
  3. call boot_pyhardlinkbackup.cmd as admin (Right-click and use Run as administrator)

If everything works fine, you will get a venv here: %ProgramFiles%\PyHardLinkBackup

After the venv is created, call these scripts to finalize the setup:

  1. %ProgramFiles%\PyHardLinkBackup\phlb_edit_config.cmd - create a config .ini file
  2. %ProgramFiles%\PyHardLinkBackup\phlb_migrate_database.cmd - create database tables

To upgrade pyhardlinkbackup, call:

  1. %ProgramFiles%\PyHardLinkBackup\phlb_upgrade_pyhardlinkbackup.cmd

To start the Django webserver, call:

  1. %ProgramFiles%\PyHardLinkBackup\phlb_run_django_webserver.cmd

Linux

  1. Download the file boot_pyhardlinkbackup.sh
  2. call boot_pyhardlinkbackup.sh

If everything works fine, you will get a venv here: ~\pyhardlinkbackup

After the venv is created, call these scripts to finalize the setup:

  • ~/PyHardLinkBackup/phlb_edit_config.sh - create a config .ini file
  • ~/PyHardLinkBackup/phlb_migrate_database.sh - create database tables

To upgrade pyhardlinkbackup, call:

  • ~/PyHardLinkBackup/phlb_upgrade_pyhardlinkbackup.sh

To start the Django webserver, call:

  • ~/PyHardLinkBackup/phlb_run_django_webserver.sh

Starting a backup run

To start a backup run, use this helper script:

  • Windows batch: %ProgramFiles%\PyHardLinkBackup\pyhardlinkbackup_this_directory.cmd
  • Linux shell script: ~/PyHardLinkBackup/pyhardlinkbackup_this_directory.sh

Copy this file to a location that should be backed up and just call it to run a backup.

Verifying an existing backup

$ cd pyhardlinkbackup/
~/PyHardLinkBackup $ source bin/activate

(PyHardLinkBackup) ~/PyHardLinkBackup $ phlb verify --fast ~/PyHardLinkBackups/documents/2016-01-07-102310

With --fast the files' contents will not be checked. If not given: The hashes from the files' contents will be calculated and compared. Thus, every file must be completely read from filesystem, so it will take some time.

A verify run does:

  • Do all files in the backup exist?
  • Compare file sizes
  • Compare hashes from hash-file
  • Compare files' modification timestamps
  • Calculate hashes from files' contents and compare them (will be skipped if --fast used)

Configuration

phlb will use a configuration file named: PyHardLinkBackup.ini

Search order is:

  1. current directory down to root
  2. user directory

E.g. if the current working directoy is /foo/bar/my_files/ then the search path will be:

  • /foo/bar/my_files/PyHardLinkBackup.ini
  • /foo/bar/PyHardLinkBackup.ini
  • /foo/PyHardLinkBackup.ini
  • /PyHardLinkBackup.ini
  • ~/PyHardLinkBackup.ini The user home directory under Windows/Linux

Create / edit default .ini

You can just open the editor with the user directory .ini file with:

(PyHardLinkBackup) ~/PyHardLinkBackup $ phlb config

The defaults are stored here: /phlb/config_defaults.ini

Excluding files/folders from backup:

There are two ways to exclude files/folders from your backup. Use the follow settings in your PyHardLinkBackup.ini

# Directory names that will be recursively excluded from backups (comma separated list!)
SKIP_DIRS= __pycache__, temp

# glob-style patterns to exclude files/folders from backups (used with Path.match(), Comma separated list!)
SKIP_PATTERNS= *.pyc, *.tmp, *.cache

The filesystem scan is divided into two steps: 1. Just scan the filesystem tree 2. Filter and load meta data for every directory item

The SKIP_DIRS is used in the first step. The SKIP_PATTERNS is used the the second step.

Upgrading pyhardlinkbackup

To upgrade to a new version just start this helper script:

Some notes

What is 'phlb' and 'manage' ?!?

phlb is a CLI.

manage is similar to a normal Django manage.py, but it always uses the pyhardlinkbackup settings.

Why in hell do you use Django?!?

  • Well, just because of the great database ORM and the Admin Site. ;)

How to go into the Django admin?

Just start:

  • Windows: phlb_run_django_webserver.cmd
  • Linux: phlb_run_django_webserver.sh

And then request 'localhost' (Note: --noreload is needed for Windows with venv!)

Running the unit tests

Just start: phlb_run_tests.cmd / phlb_run_tests.sh or do this:

$ cd pyhardlinkbackup/
~/PyHardLinkBackup $ source bin/activate
(PyHardLinkBackup) ~/PyHardLinkBackup $ manage test

Using the CLI

$ cd pyhardlinkbackup/
~/PyHardLinkBackup $ source bin/activate
(PyHardLinkBackup) ~/PyHardLinkBackup $ phlb --help
Usage: phlb [OPTIONS] COMMAND [ARGS]...

  pyhardlinkbackup

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  add     Scan all existing backup and add missing ones...
  backup  Start a Backup run
  config  Create/edit .ini config file
  helper  link helper files to given path
  verify  Verify a existing backup

Add missing backups to the database

phlb add can be used in different scenarios:

  • recreate the database
  • add a backup manually

phlb add does this:

  • scan the complete file tree under BACKUP_PATH (default: ~/PyHardLinkBackups)
  • recreate all hash files
  • add all files to database
  • deduplicate with hardlinks, if possible

So it's possible to recreate the complete database:

  • delete the current .sqlite file
  • run phlb add to recreate

Another scenario is e.g.:

  • DSLR images are stored on a network drive.
  • You have already a copy of all files locally.
  • You would like to add the local copy to pyhardlinkbackup.

Do the following steps:

  • move the local files to a subdirectory below BACKUP_PATH
  • e.g.: ~/PyHardLinkBackups/pictures/2015-12-29-000015/
  • Note: the date format in the subdirectory name must match the SUB_DIR_FORMATTER in your config
  • run: phlb add

Now you can run phlb backup from your network drive to make a new, up-to-date backup.

Windows Development

Some notes about setting up a development environment on Windows: /dev/WindowsDevelopment.creole

Alternative solutions

See also: https://github.com/restic/others#list-of-backup-software

History

  • dev - compare v0.13.0...master
    • TBC
  • 18.03.2020 - v0.13.0 - compare v0.12.3...v0.13.0
    • Dynamic chunk size
    • replace CHUNK_SIZE in PyHardLinkBackup.ini with MIN_CHUNK_SIZE
    • Fix misleading error msg for dst OSError, bad exception handling #23
    • Fix "run django server doesn't work" #39
    • Fix "add" command
    • Deactivate PyPy tests on Travis CI
  • 17.03.2020 - v0.12.3 - compare v0.12.2...v0.12.3
    • Fix #44 - wroing file size in process bar
    • use pytest-randomly
    • update requirements
  • 06.03.2020 - v0.12.2 - compare v0.12.1...v0.12.2
  • 05.03.2020 - v0.12.1 - compare v0.12.0...v0.12.1
    • revert lowercase PyHardLinkBackup for environment destination and default backup directory.
  • 05.03.2020 - v0.12.0 - compare v0.11.0...v0.12.0
    • Refactor backup process: Use https://github.com/jedie/IterFilesystem for less RAM usage and faster start on big source trees
    • modernized project/sources:
      • Update to Django v2.2.x TLS
      • use poetry
      • add Makefile
      • run tests with pytest and tox
      • run tests only with python 3.8, 3.7, 3.6
      • run tests on github, too
      • remove support for python 3.5 (os.scandir fallback removed)
    • NOTE: Windows support is not tested, yet! (Help wanted)
  • 03.03.2019 - v0.11.0 - compare v0.10.1...v0.11.0
    • Update boot files
    • Use django v1.11.x
  • 09.09.2016 - v0.10.1 - compare v0.10.0...v0.10.1
  • 26.04.2016 - v0.10.0 - compare v0.9.1...v0.10.0
    • move under Windows the install location from %APPDATA%\PyHardLinkBackup to %ProgramFiles%\PyHardLinkBackup
    • to 'migrate': Just delete %APPDATA%\PyHardLinkBackup and install via boot_pyhardlinkbackup.cmd
  • 26.04.2016 - v0.9.1 - compare v0.9.0...v0.9.1
    • run migrate database in boot process
    • Add missing migrate scripts
  • 10.02.2016 - v0.9.0 - compare v0.8.0...v0.9.0
  • 04.02.2016 - v0.8.0 - compare v0.7.0...v0.8.0
    • New: add all missing backups to database: phlb add (s.above)
  • 03.02.2016 - v0.7.0 - compare v0.6.4...v0.7.0
    • New: verify a existing backup
    • IMPORTANT: run database migration is needed!
  • 01.02.2016 - v0.6.4 - compare v0.6.2...v0.6.4
    • Windows: Bugfix temp rename error, because of the Windows API limitation, see: #13
    • Linux: Bugfix scanner if symlink is broken
    • Display local variables on low level errors
  • 29.01.2016 - v0.6.3 - compare v0.6.2...v0.6.3
    • Less verbose and better information about SKIP_DIRS/SKIP_PATTERNS hits
  • 28.01.2016 - v0.6.2 - compare v0.6.1...v0.6.2
    • Handle unexpected errors and continue backup with the next file
    • Better handle interrupt key during execution
  • 28.01.2016 - v0.6.1 - compare v0.6.0...v0.6.1
    • Bugfix #13 by using a better temp rename routine
  • 28.01.2016 - v0.6.0 - compare v0.5.1...v0.6.0
    • New: faster backup by compare mtime/size only if old backup files exists
  • 27.01.2016 - v0.5.1 - compare v0.5.0...v0.5.1
    • IMPORTANT: run database migration is needed!
    • New .ini setting: LANGUAGE_CODE for change translation
    • mark if backup was finished compled
    • Display information of last backup run
    • Add more information into summary file
  • 27.01.2016 - v0.5.0 - compare v0.4.2...v0.5.0
    • refactory source tree scan. Split in two passed.
    • CHANGE SKIP_FILES in .ini config to: SKIP_PATTERNS
    • Backup from newest files to oldest files.
    • Fix #10:
      • New --name cli option (optional) to force a backup name.
      • Display error message if backup name can be found (e.g.: backup a root folder)
  • 22.01.2016 - v0.4.2 - compare v0.4.1...v0.4.2
  • 22.01.2016 - v0.4.1 - compare v0.4.0...v0.4.1
    • Skip files that can't be read/write. (and try to backup the remaining files)
  • 21.01.2016 - v0.4.0 - compare v0.3.1...v0.4.0
    • Search for PyHardLinkBackup.ini file in every parent directory from the current working dir
    • increase default chunk size to 20MB
    • save summary and log file for every backup run
  • 15.01.2016 - v0.3.1 - compare v0.3.0...v0.3.1
    • fix unittest run under windows
  • 15.01.2016 - v0.3.0 - compare v0.2.0...v0.3.0
    • database migration needed
    • Add 'no_link_source' to database (e.g. Skip source, if 1024 links created under windows)
  • 14.01.2016 - v0.2.0 - compare v0.1.8...v0.2.0
    • good unittests coverage that covers the backup process
  • 08.01.2016 - v0.1.8 - compare v0.1.0alpha0...v0.1.8
    • install and runable under Windows
  • 06.01.2016 - v0.1.0alpha0 - d42a5c5
    • first Release on PyPi
  • 29.12.2015 - commit 2ce43
    • commit 'Proof of concept'

Links

Donating