shttpfs

Client/server Http File Sync Utility


Keywords
binary-data, file-sync, file-synchronization, sync, version-control
License
MIT
Install
pip install shttpfs

Documentation

This is a client-server based file synchronisation tool inspired by subversion, but is considerably simpler in it's implementation. I work with hundreds of gigabytes of binary files, mostly camera raw files, and require a means of synchronising these across multiple computers. I have tried out numerous tools over the past few years, and subversion is the only one with the required features which has worked dependably. Unfortunately subversion always stores two copies of every file in every checkout for delta compression. When dealing with large volumes of binary files this greatly blows up the storage required. As I mostly add to the files and rarely change them, delta compression offers little value in my use case. Finally due to limited outbound bandwidth combined with the large volume of data, I required a tool that is hosted exclusively in mu LAN.

* HTTP(S) based sync protocol.

Uses a simple HTTP(S) based sync protocol and public key authentication. The server is implemented using the flask framework.


* All files, both on the client and the server, are stored as plain files with there original names. 

* Stores limited version history on the server only.

Has limited support for file versioning. As above all prior file versions are stored as-is in the file system, so can be easily accessed with any file browser. However this is not a version control system, file merging for instance is not supported. It is generally not possible to usefully merge binary files anyway. 


* No web/graphical UI

Designed to perform a single function only, provides a minimal command line interface.

* No database dependency

Stores file manifest information as regular JSON.


* Atomic file system operations through journaling. 

File system operations are journaled, and committed atomically in stable states. Errors trigger automatic roll-back to the previous good state.


* Supports partial checkouts 

As I don't want the entire data set on every machine partial checkout support was essential. Currently it is implemented using a pull-ignore system. Create a file '.pull_ignore' in the root and list any files and directories which are not required, supports Unix wild cards. 



Notice: This has only been tested on linux ext4, running on other operating systems or filesystems could cause undefined behaviour.

---------------------------------------------------

Following is a list of tools I have tried up to this point, every one fails in some regard:

- Rsync:
Awesome for what it was designed for and I use it extensively for other things. However it's only able to do one-way syncing and can't detect conflicts. 

- Unison:
Similar to rsync but able to do two way syncing and conflict detection. I found it awkward to use for syncing more than two end points, Change detection sometimes failed and wanted to re-sync the entire file tree which can take several hours.

- Subversion:
By far the best of anything I've used. Handles large binary files without issue. However it always stores two copies of every file on the client in order to do delta compression. As my data set is already large I always had to be concious of this when adding files. Eventually gave up on it for this reason.

- Git, Mercurial, Bazzar:
Can not support partial checkouts, having entire history locally is too much overhead. Git in particular does not work well with large binary files and can use an awful lot of memory. These tools work fine for managing source, or other textual content and I do use git extensively for other things.

- Sync thing:
Tries to be clever using broadcast auto detection rather than simply letting you input an IP or host name. Could not find the other client on my network and the UI offered no obvious manual override.

- Owncloud:
Web UI is completely unnecessary for my needs. Repeatedly stopped syncing when adding my file tree for the first time. Provided no useful error messages as to why. 

- Seafile:
Web UI is completely unnecessary for my needs. Looked promising to begin with, accepted my files without issue at first. However like Owncloud the client started to randomly desynchronise with the server and required me to manually re-add it. Once again provided no useful error messages.

- Git-annex:
A large file manager built on top of git and following its distributed model. It is very complicated. The android client while pulling files down stopped entirely after running into a FAT-incompatible file name and provided no obvious way to continue. All or nothing failure is unacceptable to me. Like the two above the web UI gives little indication of what it is doing, getting error messages required bypassing the web ui and looking at the command line.