leb.io/dedup

dedup scans files or directories and calculates fingerprint hashes for them based on their contents. Originally written on a plane from SFO->EWR on 7-23-15 in about an hour. Based on an idea I had been mulling in my mind for years. Without the -d (directory) switch dedup recursively scans the supplied directories and files in depth first order and records the hash of each file in a map of slices keyed by the hash. After the scan is complete, the resulting map is iterated, and if any of the slices have a length of more than 1, then all the files on that slice are all duplicates of each other. If -d switch is supplied the hashes of files in each directory are themselves recursively hashed and the resulting fingerprints for each directory (but not the files) are recorded in the map. Again, if the length of any slice is more than 1, then the entire directory is duplicated. The -d switch works with more than two directories, but sometimes not as well. If the -r switch is supplied, reverses the sense of the program and files or directories that ARE NOT duplicated are printed. When the map is scanned, any slices with a length different than the number of supplied directories are printed as these represent missing files. This allows directories to be easily compared and more than two can easily be compared. Even cooler is that the program works even if files or directories have been renamed. Without a switch to print, no output is generated. The -p prints out the pathnames of duplicated or missing files o directories. The -ps prints a summary of the number of files or dir that were duplicated and now much space they take up. The F, S, H, L, and N switches print the fingerprint, size, human readable size, hash chain length, and number of roots respectively. Examples % dedup -p ~/Desktop % dedup -d -p dir1 dir2 % dedup -d -r -p dir1 dir2 The hash used is the asehash from the Go runtime. It's fast and passes smhahser. The map of slices is not the most memory efficient representation and at some point it probably makes sense to switch to a cuckoo hash table.


License
MIT
Install
go get leb.io/dedup