BTRFS Deduplication tool
Deduplication tool like bedup. I wrote it quite some time ago already because bedup had problems with my volume and the number of snapshots (crashes, database corruption etc.)
Btrdedup uses much less resources especially in case of many snapshots. The limitation is that it only deduplicates files that start with the same content. By inspecting the fragmentation before offering the files for deduplication to the kernel (using the btrfs deduplication ioctl) data that is already shared will not be deduplicated again.
Btrdedup does not maintain state between runs. This makes it less suitable for incremental deduplication. On the other hand it makes the tool very robust and because of its efficiency in detecting already deduplicated files it can easily be scheduled to run once a month for example.
Download the latest release:
Make executable using:
chmod +x btrdedup
Typically you want to run the program as root on the complete mounted btrfs pool with a command like this:
nice -n 10 ./btrdedup /mnt 2>dedup.log
nice -n 10 ./btrdedup /mnt >dedup.out 2>dedup.log &
The scanning phase may still take a long time depending on the number of files. The most expensive part however, the deduplication itself, is only called when necessary.
Btrfdedup is very memory efficient and doesn't require a database. It can be instructed to use even less memory
by providing the
-lowmem option. This may require a few more minutes, but it may also be faster because of reduced
memory management. Future versions might default to this option.
btrdedup -h for the full list of options.
Under the hood
Btrdedup works by first reading the file tree(s) in memory in an efficient data structure. It then processes these files in three passes:
Pass 1: Read the fragmentation table for each file.
Sort the result on the offset of the first block
Pass 2: Calculate the hash of the first block of each file. Because the files are sorted on the first block offset, any block is only loaded and hashed once.
Sort the result on the hash of the first block
Pass 3: Files that have the first block in common are offered for deduplication. The deduplication phase will first check if blocks are already shared to only offer data for actual deduplication if necessary.
The last pass is still to be improved. Currently only the prefixes of files are deduplicated. As soon as blocks of files differ the deduplication assumes the remainder of the files doesn't share blocks.
In lowmem mode, the output of each pass is written to an encoded temporary text file which is then sorted using the