Bash Scripts
A few random scripts that saved my life.
Detecting Duplicates and Replacing Them with Hardlinks
Six months after downloading terabytes of media, I realized that Sonarr and Radarr were copying them into my Plex library instead of creating hardlinks. This happens due to a counterintuitive mechanism: if you mount multiple folders in Sonarr/Radarr, it sees them as different filesystems and thus cannot create hardlinks. That’s why you should mount only one parent folder containing all child folders (like downloads
, movies
, tvseries
inside a media
parent folder).
So I restructured my directories, manually updated every path in Qbittorrent, Plex, and others. The last challenge was finding a way to detect existing duplicates, delete them, and automatically create hardlinks instead—to save space.
My directory structure:
.
└── media
├── seedbox
├── radarr
│ └── tv-radarr
├── movies
└── tvseries
The originals are in seedbox
and must not be modified to keep seeding. The copies (duplicates) are in movies
and tvseries
. To complicate things, there are also unique originals in movies
and tvseries
. And within those, there can be subfolders, sub-subfolders, etc.
So the idea is to:
- list the originals in seedbox
- list files in movies and tvseries
- compare both lists and isolate duplicates
- delete the duplicates
- hardlink the originals to the deleted duplicate paths
Yes, I asked ChatGPT and Qwen3 (which I host on a dedicated AI machine). Naturally, they suggested tools like rfind, rdfind, dupes, rdupes, rmlint... But hashing 30TB of media would take days, so I gave up quickly.
In the end, I only needed to find .mkv
files, and duplicates have the exact same name as the originals, which simplifies things a lot. A simple Bash script would do the job.
Spare you the endless Q&A with ChatGPT—I was disappointed. Qwen3 was much cleaner. ChatGPT kept pushing awk-based solutions, which fail on paths with spaces. With Qwen’s help and dropping awk, the results improved significantly.
To test, I first asked for a script that only lists and compares:
#!/bin/bash
# Create an associative array to store duplicates
declare -A seen
# Find all .mkv files only (exclude directories)
find /media/seedbox /media/movies /media/tvseries -type f -name "*.mkv" -print0 | \
while IFS= read -r -d '' file; do
# Get the file's inode and name
inode=$(stat --format="%i" "$file")
filename=$(basename "$file")
# If the filename has been seen before
if [[ -n "${seen[$filename]}" ]]; then
# Check if the inode is different from the previous one
if [[ "${seen[$filename]}" != "$inode" ]]; then
# Output the duplicates with full paths
echo "Duplicates for \"$filename\":"
echo "${seen["$filename"]} ${seen["$filename:full_path"]}"
echo "$inode $file"
echo
fi
else
seen[$filename]="$inode"
seen["$filename:full_path"]="$file"
fi
done
This gave me outputs like:
Duplicates for "episode1.mkv":
1234567 /media/seedbox/sonarr/Serie 1/Season1/episode1.mkv
2345678 /media/tvseries/Serie 1/Season1/episode1.mkv
With awk
, it would’ve stopped at /media/seedbox/sonarr/Serie
. I’m far from an expert, but Qwen3 performed better and explained everything clearly.
Once I verified the output, I asked for a complete script: compare, delete duplicates, create hardlinks.
Again, ChatGPT disappointed. Despite my requests, it created hardlinks before deleting the duplicates—effectively linking and then deleting the link (though the original is kept). Not helpful.
Quick stopover to Qwen3, RTX 5090 in overdrive, and bam—much better result. Yes, it kept ChatGPT-style emojis, but here it is:
#!/bin/bash
echo "🔍 Step 1: Indexing original files in /media/seedbox..."
declare -A seen
# Index all .mkv files in seedbox
while IFS= read -r -d '' file; do
filename=$(basename "$file")
seen["$filename"]="$file"
done < <(find /media/seedbox -type f -name "*.mkv" -print0)
echo "📦 Step 2: Automatically replacing duplicates..."
total_doublons=0
total_ko_saved=0
while IFS= read -r -d '' file; do
filename=$(basename "$file")
original="${seen[$filename]}"
if [[ -n "$original" && "$original" != "$file" ]]; then
inode_orig=$(stat -c %i "$original")
inode_dupe=$(stat -c %i "$file")
if [[ "$inode_orig" != "$inode_dupe" ]]; then
size_kb=$(du -k "$file" | cut -f1)
echo "🔁 Replacing:"
echo " Duplicate : $file"
echo " Original : $original"
echo " Size : ${size_kb} KB"
rm "$file" && ln "$original" "$file" && echo "✅ Hardlink created."
total_doublons=$((total_doublons + 1))
total_ko_saved=$((total_ko_saved + size_kb))
fi
fi
done < <(find /media/movies /media/tvseries -type f -name "*.mkv" -print0)
echo ""
echo "🧾 Summary:"
echo " 🔗 Duplicates replaced by hardlink: $total_doublons"
echo " 💾 Approx. disk space saved: ${total_ko_saved} KB (~$((total_ko_saved / 1024)) MB)"
echo "✅ Done."
So, in conclusion, I:
- Learned many Bash subtleties
- Learned never to blindly copy-paste a ChatGPT script without understanding and dry-running it
- Learned that Qwen on a RTX 5090 is more coherent than ChatGPT-4o on server farms (not even mentioning “normal” ChatGPT)
- Learned that even with 100TB of storage, monitoring it would’ve alerted me much earlier to the 12TB of duplicates lying around
Catch you next time for more exciting adventures.