Python version support: CPython 2.6, 2.7, 3.2, 3.3, 3.4 and PyPy.
How it works
killdupes scans your filesystem to find duplicate files, partial files
and empty files.
It performs n:n comparison of files through md5 hashing and heavy use of dictionaries. Execute with wildcard, or input file containing file names to check.
- Scan all files, find the smallest.
read sizeamount of bytes (equal to the remaining size of the smallest file, or at most
CHUNKsize) from all files into
- Hash all records, use hashes as keys into
- Files in the same bucket are known to be equal up to this offset.
- Continue until at least two files remain that are still equal at all offsets.
- Equal files are either a duplicate case (if they are the same size), or one is partial relative to the other (if not the same size).
Memory consumption should not exceed
files_in_bucket * read_size.
The algorithm adapts to file changes; it will read all files until eof regardless of the filesize as recorded at startup.
$ pip install killdupes
$ killdupes.py 'tests/samples/*' Empty files: X 0.0 B 14.03.14 17:39:48 tests/samples/empty Incompletes: = 2.0 B 14.03.14 18:17:43 tests/samples/full X 1.0 B 14.03.14 18:17:26 tests/samples/partial Duplicates: = 2.0 B 14.03.14 18:17:43 tests/samples/full X 2.0 B 14.03.14 18:17:37 tests/samples/full2 Kill files? (all/empty/incompletes/duplicates) [a/e/i/d/N]
If there are many files to scan it will display a progress dashboard while working:
176.1 KB | Offs 0.0 B | Buck 1/1 | File 193868/600084 | Rs 1.0 B
The dashboard fields:
- Total bytes read
- Current offset of reading
- Current number of buckets
- File/files in this bucket
- Readsize at this offset