De-Duping old backups
- Daniel Netwal
- Mar 12
- 3 min read

So I finally have time to get around to cleaning up my old drives and backups... There's about 10 years worth of stuff floating out there on Dropbox, OneDrive, and old physical drives pulled from retired computers.
None of it is organized or structured in any way shape or form, so how am I supposed to start cleaning it up?
Well first thing was to get everything copied off the physical media and on to my desktop. 500+gb later, I have millions of files ready to be reviewed. Yikes.

Ok right off the bat I can clean things up by removing most of the junk. Anything C:\Windows can go right out the, well window. :D File extensions that I'm never going to take, .ini, thumbnaildb, pretty much anything that resembles system files.
That brings me down to around 200k files to compare. Here's where it got tricky. The way my phone/camera would save pictures back in the day, wasn't using a unique file name. So, lots of pictures will have the same name but be different images. Date and time aren't great either because within these old backups, are other old backups that have the same picture copied and duplicated, but with different created dates. So that got me thinking about file hashes.

You know who else was thinking about hashes? my CPU for the next several hours as it chewed through the script :D
But before I got there, I had to start at the beginning. I'm not a scripter or code writer. I barely know enough to get a script I find online to run. (PowerShell ISE Anyone?) That's where Visual Studio Code and the GitHub Copilot come in.
If you haven't used an AI tool in the past, 3 months, you're really missing out. This was insane how easy it was to write natural language prompts and work through a couple of different versions of the script, to get to what you see here.

The first version was very boring and didn't provide any details on what it was doing, which made it appear that the script had hung, when in fact it was hard at work calculating the hashes of thousands of files.

Don't mind the Antimalware, it wouldn't be a true backup without there being a few unsavory service utilities in there now would it?
But still, things were slow.. How does a hash value work anyway...

Ok, great, but I still can't see what's going on... so I asked Copilot to add some verbose logging to the script. It did, easily. It still looked like the script was hung because it took so long to create the duplicate list then compare, so I asked to make another change. Please break the script down into multiple functions to generate a list of duplicate file names first, then only compute and compare the hash values for that list. It did it.

Now we were really cooking, but it still didn't display much about what was going on. So I asked it to create batches of 10 matches at a time to compare hashes, and spit out the file name on them as it compared. This is where my terminal window blew up, and has been spewing out progress information for the last 2 hours.
I know it's not elegant, I know I could have cleaned up my folders a little better before feeding them through the dedupe wood chipper.... But for not knowing how to write PowerShell very well, and only spending 30 minutes with GitHub Copilot to co-author this script.. I think it's great.
I've got thousands of iPhotos images to deal with next.. but I kid you not as I sit here typing, the script finally finished! Down to only 181GB on the disk. Now I need to write a script to delete all the empty folders left behind..

Hey Copilot!! I've got another question!


Comments