Dealing With Duplicate Files

14 minute read

In the physical world, we encounter much difficulty because it’s hard to create copies of things: objects use finite resources and are expensive to produce, we have to physically repeat tasks over and over to do them multiple times, and so on. Ironically, in the digital world, many problems instead stem from how easy it is to copy things: some people make unauthorized copies of media and anger the distribution companies; others get doxxed and material they never wanted to share with anyone makes its way to millions of people; and all of us end up with four copies of the same set of files wasting our hard drive space and preventing us from remembering where we put things!

Fortunately, when it comes to extra files on our computers, there are tools to help us identify and clean up the duplicates. The computer can’t magically fix your duplicates for you, since they’re a symptom of bad organization rather than a problem in themselves, but it can help out a lot.

I discuss two free tools here, dupeGuru (Windows/Mac/Linux) and rdfind (Mac/Linux). I’m a big fan of rdfind. It is lightning-fast, easily scriptable, and very flexible in terms of specifying what you want to do with duplicated files. However, it does not work on Windows and is slightly irritating to install on a Mac, and it only offers a command-line interface (if typing commands scares you, start with dupeGuru, but also check out why typing commands is like using language and can often be more natural than point-and-click).

If you aren’t interested in scripting, don’t expect to be doing duplicate checks often, don’t want to link duplicate files to one reference copy, and don’t mind manually selecting files to keep when several copies come up, dupeGuru is easier to learn, still quite powerful, and probably a better choice overall. And if you’re using Windows, go with dupeGuru; trying to get rdfind to work on Windows is not worth it.

There are other tools out there. dupeGuru and rdfind both search by file content; whatever tool you choose, don’t waste your time and risk getting confused with a tool that only checks file names for duplicates. This creates vast numbers of both false positives (e.g., 15 files called report.pdf that are totally unrelated) and false negatives (e.g., 50 copies of the same file called something different in each place).

What to do with duplicate files

Before we look at our tools, let’s consider what we want to do with them. As alluded to earlier, there is no single answer to how duplicate files should be handled. The right answer depends on why the files are duplicates, so it’s something you’ll have to think through yourself. However, let’s go through what options you have once you’ve identified duplicate files using one of these tools.

Ignore the duplicates

Sometimes you have files that are technically duplicates but aren’t a big deal. For instance, some programs might create configuration files that only contain a few words and might occasionally be the same from program to program or be stored in multiple places. These don’t take up much hard drive space and they aren’t confusing you, and figuring out how to clean them up without breaking the programs could be challenging, so ignoring them is the best option.

Reorganize your filesystem

Sometimes finding duplicates will allow you to realize that you have virtually the same files in multiple locations, and when you investigate the similarities you can merge together or outright delete some folders. This is a great outcome – you’re saving space and making big improvements on your organization.

Delete the files

Maybe you found a 20GB folder of backups from 2006 (I still have an unpleasant number of these lying around!) and it contains hundreds of files you have somewhere else. You don’t need that backup from 2006 as a backup anymore (if you think you might, go make an actual backup!), so you can just plain delete those files.

This can be a handy option if you discover you are storing files in multiple places because they actually make sense in multiple places. For an example, see my system for organizing my photo folders, or the reports example below. However, picking symlinks without actually looking at the files you’re linking is the worst of all possible choices! As I’ve mentioned before, for good results, links should be used only to augment an already decent organizational system rather than substituting for one.

Unfortunately, dupeGuru cannot create links, so if you’re not willing or able to use rdfind or go hunting for another duplicate-finding tool altogether, you’ll have to create any needed links by hand or choose another option.

This will keep all the files in place as before, but make them all point to the same data on disk. All the usual drawbacks of hard links apply (see the link above). There are good use cases for this (for instance, making backup snapshots or allowing users to upload popular files to a website and store them in their own accounts without wasting space), but unless you know you have one of the classic good use cases, you usually should not choose this option. Again, only rdfind and not dupeGuru can create hard links from your duplicates.

Using dupeGuru

You’ll first want to download dupeGuru from its official website, for Windows, Linux, or Mac OS X as appropriate for your computer. After starting dupeGuru, you’ll see something like this:

Screenshot of the opening page of DupeGuru

Click the + button and choose the folder you want to scan for duplicates. You may also want to specify some folders as “references” by clicking the drop-down in the State column, or exclude them from the check entirely. Reference folders are much like rdfind’s concept of an “original” folder; the files in these folders will always be considered the definitive copy and never offered for deletion or other treatment. Here’s an example of scanning my main hard drive (there’s no reason behind my exclude or reference choices, I just put them in as an example):

Screenshot of the opening page of DupeGuru, with my options selected

You can add multiple folders if you need to. Then choose Scan. Once the scan completes, you’ll see a list of duplicated files. The top entry in each grouping is the “reference” file and can’t be selected (according to the documentation, one that’s in a “reference” folder if you specified one earlier, or else whichever file is the largest, or else I guess by black magic if they’re all the same size). The remaining (indented) entries are duplicate copies of that file. You can see that I’ve highlighted here three backups of a wiki page that contain the exact same content:

Screenshot of the opening page of DupeGuru, with my options selected

There are a number of fancy ways to select files; for instance, you can use the Filter in the upper-right and then mark all the files that match the search (using Edit → Mark All). Typically, you will just want to run through the list and check off the files you want to delete. When you’re done, choose Actions → Send Marked to Trash. You can also choose to move them or even copy them (if you want to create even more duplicates?). I have to admit I’m kind of confused about the use case for these options, but they’re there if you can think of one.

Note: dupeGuru can also do “fuzzy matching” on music and photos – finding files which are very similar but not quite identical. I haven’t tried this function myself, but I’ve heard mixed reviews on its effectiveness and accuracy. If you want to try it, you can select this option on the initial screen when you start a scan with dupeGuru.

Using rdfind

Installation

On Linux, rdfind is readily available in most package repositories (e.g., on Debian-based distributions like Ubuntu and Linux Mint, run sudo apt install rdfind). If it’s not in yours, you can download from GitHub and compile from source.

On MacOS, you want to install through MacPorts. If you don’t already have it installed, first follow the instructions to install MacPorts. Then open the Terminal app and run the following commands, pressing Enter after each and providing your password on request:

sudo port selfupdate
sudo port install rdfind

Go grab a cup of tea – it can take a few minutes to install everything. When it finishes, you should be able to type rdfind and get a help message.

rdfind basics

The easiest way to find duplicates is to change into the folder you want to search for duplicates (an easy way: type cd , drag and drop the folder out of Finder or your file browser into the terminal window, and hit Enter). and run rdfind . (the dot indicates you want to search the folder you’re in). This will create a text file called results.txt in this folder, containing one file per line. The display looks intimidating at first, but you don’t need to understand most of it. Here’s an example:

# Automatically generated
# duptype id depth size device inode priority name
DUPTYPE_FIRST_OCCURRENCE 492 3 920 436207633 86712 1 ./.mypy_cache/3.6/urllib/__init__.data.json
DUPTYPE_WITHIN_SAME_TREE -492 4 920 436207633 447327 1 ./actions/.mypy_cache/3.6/urllib/__init__.data.json
DUPTYPE_FIRST_OCCURRENCE 427 6 1097 436207633 86724 1 ./.mypy_cache/3.6/requests/packages/urllib3/packages/__init__.data.json
DUPTYPE_WITHIN_SAME_TREE -427 7 1097 436207633 447339 1 ./actions/.mypy_cache/3.6/requests/packages/urllib3/packages/__init__.data.json
DUPTYPE_FIRST_OCCURRENCE 451 3 1346 436207633 86734 1 ./.mypy_cache/3.6/requests/compat.data.json
DUPTYPE_WITHIN_SAME_TREE -451 4 1346 436207633 447349 1 ./actions/.mypy_cache/3.6/requests/compat.data.json

The first column lists what “type” of file it is. The first file found shows DUPTYPE_FIRST_OCCURRENCE, while all other instances of the same file will indicate different types (exactly which type is usually not important). The last column is the path to this file.

If you have only a few duplicates or the listing of duplicates identifies issues in your organizational system you can fix, you can deal with them manually and then rerun the check to confirm you got them all, but if you have a bunch, you’ll probably want rdfind to delete them or create links for you. To do so, run rdfind again, including -makesymlinks true, -makehardlinks true, or -deleteduplicates true in the command (e.g., rdfind -deleteduplicates true .). If you’re not sure which to use, review the What to do with duplicate files section earlier on this page.

One place most duplicate-finding tools fall flat is deciding which file to delete when there are multiple copies. You either have to manually specify which to delete, or the tool picks one essentially at random (in the better tools, like dupeGuru, you can at least choose specific folders whose files should not be deleted).

rdfind makes this easy and customizable by introducing the concept of a ranking. Above we gave rdfind just one folder to search, ., but we can actually give it a list of folders separated by spaces, like rdfind folder1 folder2 folder3. We can also use wildcards, like rdfind special-folder * (list special-folder first, then everything in the folder alphabetically – “everything” includes special-folder itself, but rdfind is smart and will ignore it the second time if we list it twice). Files in folders which come first on the list are ranked higher, and files which are higher up in the folder tree are ranked higher (e.g., folder1/file is ranked above folder1/subfolder1/file). Whichever file has the highest ranking is considered the “original” and all other files are deleted and turned into links. Thus, you simply make sure the copies you want to keep come first in the list and rdfind does exactly what you need, automatically. This is so much easier than any other tool I’ve ever used that I’m baffled it’s not a more common approach. See below for an example.

Warning: If these rules don’t place at least one copy ahead of the others, the ranking is undefined and rdfind will choose which copy to keep itself (the choice will probably be consistent if you run rdfind multiple times, but which file is chosen depends on the order your operating system decides to list them in).

Example cleanup with rdfind

While rdfind is more flexible than dupeGuru, it’s a bit harder to learn without a visual interface where you can easily see what’s going on. Let’s take a concrete look at one way we could clean up some duplicates.

Imagine we have the following folder structure (this example is not a perfect folder structure, but bear with me):

|- documents
|  |- budget
|  |- current-job
|  |- old-job
|  |- reports
|- pictures
|  |- backups
|  |- family
|     |- my-favorite-event
|     |- ...
|  |- professional
|     |- another-great-event
|     |- ...

And this results.txt from running our first rdfind . at the top of this tree:

DUPTYPE_FIRST_OCCURRENCE ... documents/old-job/Z.docx
DUPTYPE_WITHIN_SAME_TREE ... documents/reports/2018/Z.docx
... more reports like this
DUPTYPE_FIRST_OCCURRENCE ... pictures/backups/A.jpg
DUPTYPE_WITHIN_SAME_TREE ... pictures/backups/C.jpg
DUPTYPE_FIRST_OCCURRENCE ... pictures/family/my-favorite-event/A.jpg
DUPTYPE_WITHIN_SAME_TREE ... pictures/professional/another-great-event/C.jpg
... more pictures like this

After looking at what duplicates we have above, we’ve reviewed our requirements and decided we want to keep a copy of all documents in reports and create symbolic links to there from any other documents folders. For our pictures, we want to keep those in our family and professional folders and delete anything in the backups folder that we have an identical copy of somewhere else.

Here’s how we could do that. First, we enter the documents folder and run rdfind, setting the reports folder as the definitive copy by placing it first and placing everything else next so any duplicates will be turned into links.

cd documents
rdfind -makesymlinks true -dryrun true reports *

The -dryrun option will cause rdfind to change nothing but instead spit out a list of all the changes it would have made. We double-check that list looks right, then remove the -dryrun true and run again. This might take a little longer, but it takes a lot longer if you mess something up! As the saying goes, measure twice, cut once…

Now for the pictures.

cd ../pictures
rdfind family professional
rdfind -deleteduplicates true -dryrun true family professional backups
rdfind -deleteduplicates true family professional backups

cd ../pictures moves into the parent folder (that is, the top level of our example filesystem) and then enters the pictures folder.

We then use rdfind on all the folders that are not backups (viz., family and professional). We are hoping to get no duplicates at all here, because our requirement was to delete only things in the backups folder. If there are duplicates here that don’t have copies in backups as well, the ones in professional will be deleted in favor of the first copies in family, since family is the first folder in the list that contains one of the copies. Maybe that’s acceptable, but it’s not what our we identified as our goal, so it would be nice to know if we need to adapt our goal before the computer just changes things without telling us! If we did find duplicates, we would want to figure out why we have the same photos in our family and professional folder and sort that out before continuing.

Note: We don’t technically have to do this step in this case, as there were only 6 duplicates in our example list and we can plainly see that the only duplicates are between backups and another folder. However, we would definitely want to double-check if there were hundreds of files. In fact, we should have done this double-check when we symlinked the reports earlier as well; I just didn’t want to overcomplicate the first one.

Finally, we actually delete the duplicates, doing a dry-run step first to double-check our work as before. Here we had to list out all the folders rather than using * – if we had put * first, backups would have been put before family and professional since * lists in alphabetical order.