find file duplicates on very large volume

Fri Jul 18 00:23:52 PDT 2003

> From: "R. Welz" <r_welz at gmx.de>
> Date: Wed, 16 Jul 2003 21:35:32 +0200
>
> My question is: When I have the HD with all the CDR on ot is there a
> program which can handle and find duplicates? I mean not only duplicat
> names but duplicate content (in text files for example). And with that
> amount of content. And in reasonable time. Would SpringCleaning do the
> job? How loNg would it take?

You can do it wit the help of some scripts (Terminal). The first one
(finddup) steps through all files and calculates the checksum of each. It
sends the file names and their checksum to the second script (groupdup.awk)
which groups the duplicates on a single line.

Save the scripts as separate file, make sure they have 'unix style line
endings', and make the first one (finddup) executable.

Note that the first script ignores small files, as it seems that
checksumming is only reliable on larger files. You can experiment a little
though (change the +20).

Also note that this script completely ignores filenames for comparison, it
just relies on the checksum. This allows you to find duplicates that are
named differently (e.g. 'My Document' and 'My Document 'Copy)').

Good luck
Peter.

--- this is the 'finddup' file -----
#!/bin/sh
#
# Find all duplicate files in my home directory
#
# Output: A file that contains the names of all
# identical files on a single line, separated by
# spaces.
# The file names are quoted.

dupdir="$HOME"
dupfile="$HOME/dupgrp.idx"

# print checksum for larger files (chksum doesn't work reliably
# on smaller files?), then
# save them all in a temp file. Also, take the first two words
# (chksum and file size), remove duplicate entries, and find the
# remainder in the temp file.
# Finally, groupdup prints them out nicely.
#
find $dupdir -size +20 \! -type d -exec cksum {} \; | sort | tee
/tmp/$$ | cut -f 1,2 -d ' ' | uniq -d | grep -hif - /tmp/$$ \
     | groupdup.awk > $dupfile
# alternatively, just create the file
#    > $dupfile
---- end of file ---------

--- this is the 'groupdup.awk' file -----
#! /usr/bin/awk -f
#
# Group file names with the same checksum on a single line.
#
# Input: a sorted file with a file name per line.
# The first word on the line is assumed to be the checksum,
# the second word is ignored, and the last word is supposed
# to be the file name. File names are quoted,
# but I'm not yet sure what happens when file names have
# spaces
#
# Output: a sorted files where all duplicate files are
# on a single line, separated by space. The file names are
# quoted.
#
BEGIN {
     chksum = 0;
     seqnr = 0;
}

{
     if ( $1 == chksum )    {
    # same chksum -> print on same line

    printf " '%s'", $3;
     }
     else {
    # different chksum (unless start of file) -> start new line

    if ( chksum != 0) {
        printf "\n";
    }

#    printf "%s '%s'", $1, $3;
    printf "'%s'", $3;
    chksum = $1;
     }
}

END {
     printf "\n";
}
----- end of file -------------