Servarr duplicates corrector
Six months after downloading terabytes of media, I realized that Sonarr and Radarr were copying them into my Plex library instead of creating hardlinks. This happens due to a counterintuitive mechanism: if you mount multiple folders in Sonarr/Radarr, it sees them as different filesystems and thus cannot create hardlinks. Thatโs why you should mount only one parent folder containing all child folders (like downloads, movies, tvseries inside a media parent folder).
So I restructured my directories, manually updated every path in Qbittorrent, Plex, and others. The last challenge was finding a way to detect existing duplicates, delete them, and automatically create hardlinks insteadโto save space.
My directory structure:
.
โโโ media
โโโ seedbox
โโโ radarr
โ โโโ tv-radarr
โโโ movies
โโโ tvseries
The originals are in seedbox and must not be modified to keep seeding. The copies (duplicates) are in movies and tvseries. To complicate things, there are also unique originals in movies and tvseries. And within those, there can be subfolders, sub-subfolders, etc.
So the idea is to:
- list the originals in seedbox
- list files in movies and tvseries
- compare both lists and isolate duplicates
- delete the duplicates
- hardlink the originals to the deleted duplicate paths
Yes, I asked ChatGPT and Qwen3 (which I host on a dedicated AI machine). Naturally, they suggested tools like rfind, rdfind, dupes, rdupes, rmlint... But hashing 30TB of media would take days, so I gave up quickly.
In the end, I only needed to find .mkv files, and duplicates have the exact same name as the originals, which simplifies things a lot. A simple Bash script would do the job.
Spare you the endless Q&A with ChatGPTโI was disappointed. Qwen3 was much cleaner. ChatGPT kept pushing awk-based solutions, which fail on paths with spaces. With Qwenโs help and dropping awk, the results improved significantly.
To test, I first asked for a script that only lists and compares:
#!/bin/bash
# Create an associative array to store duplicates
declare -A seen
# Find all .mkv files only (exclude directories)
find /media/seedbox /media/movies /media/tvseries -type f -name "*.mkv" -print0 | \
while IFS= read -r -d '' file; do
# Get the file's inode and name
inode=$(stat --format="%i" "$file")
filename=$(basename "$file")
# If the filename has been seen before
if [[ -n "${seen[$filename]}" ]]; then
# Check if the inode is different from the previous one
if [[ "${seen[$filename]}" != "$inode" ]]; then
# Output the duplicates with full paths
echo "Duplicates for \"$filename\":"
echo "${seen["$filename"]} ${seen["$filename:full_path"]}"
echo "$inode $file"
echo
fi
else
seen[$filename]="$inode"
seen["$filename:full_path"]="$file"
fi
done
This gave me outputs like:
Duplicates for "episode1.mkv":
1234567 /media/seedbox/sonarr/Serie 1/Season1/episode1.mkv
2345678 /media/tvseries/Serie 1/Season1/episode1.mkv
With awk, it wouldโve stopped at /media/seedbox/sonarr/Serie. Iโm far from an expert, but Qwen3 performed better and explained everything clearly.
Once I verified the output, I asked for a complete script: compare, delete duplicates, create hardlinks.
Again, ChatGPT disappointed. Despite my requests, it created hardlinks before deleting the duplicatesโeffectively linking and then deleting the link (though the original is kept). Not helpful.
Quick stopover to Qwen3, RTX 5090 in overdrive, and bamโmuch better result. Yes, it kept ChatGPT-style emojis, but here it is:
#!/bin/bash
echo "๐ Step 1: Indexing original files in /media/seedbox..."
declare -A seen
# Index all .mkv files in seedbox
while IFS= read -r -d '' file; do
filename=$(basename "$file")
seen["$filename"]="$file"
done < <(find /media/seedbox -type f -name "*.mkv" -print0)
echo "๐ฆ Step 2: Automatically replacing duplicates..."
total_doublons=0
total_ko_saved=0
while IFS= read -r -d '' file; do
filename=$(basename "$file")
original="${seen[$filename]}"
if [[ -n "$original" && "$original" != "$file" ]]; then
inode_orig=$(stat -c %i "$original")
inode_dupe=$(stat -c %i "$file")
if [[ "$inode_orig" != "$inode_dupe" ]]; then
size_kb=$(du -k "$file" | cut -f1)
echo "๐ Replacing:"
echo " Duplicate : $file"
echo " Original : $original"
echo " Size : ${size_kb} KB"
rm "$file" && ln "$original" "$file" && echo "โ
Hardlink created."
total_doublons=$((total_doublons + 1))
total_ko_saved=$((total_ko_saved + size_kb))
fi
fi
done < <(find /media/movies /media/tvseries -type f -name "*.mkv" -print0)
echo ""
echo "๐งพ Summary:"
echo " ๐ Duplicates replaced by hardlink: $total_doublons"
echo " ๐พ Approx. disk space saved: ${total_ko_saved} KB (~$((total_ko_saved / 1024)) MB)"
echo "โ
Done."
So, in conclusion, I:
- Learned many Bash subtleties
- Learned never to blindly copy-paste a ChatGPT script without understanding and dry-running it
- Learned that Qwen on a RTX 5090 is more coherent than ChatGPT-4o on server farms (not even mentioning โnormalโ ChatGPT)
- Learned that even with 100TB of storage, monitoring it wouldโve alerted me much earlier to the 12TB of duplicates lying around