Servarr duplicates corrector

Six months after downloading terabytes of media, I realized that Sonarr and Radarr were copying them into my Plex library instead of creating hardlinks. This happens due to a counterintuitive mechanism: if you mount multiple folders in Sonarr/Radarr, it sees them as different filesystems and thus cannot create hardlinks. That’s why you should mount only one parent folder containing all child folders (like downloads, movies, tvseries inside a media parent folder).

So I restructured my directories, manually updated every path in Qbittorrent, Plex, and others. The last challenge was finding a way to detect existing duplicates, delete them, and automatically create hardlinks instead—to save space.

My directory structure:

.
└── media
    ├── seedbox
    ├── radarr
    │   └── tv-radarr
    ├── movies
    └── tvseries

The originals are in seedbox and must not be modified to keep seeding. The copies (duplicates) are in movies and tvseries. To complicate things, there are also unique originals in movies and tvseries. And within those, there can be subfolders, sub-subfolders, etc.

So the idea is to:

list the originals in seedbox
list files in movies and tvseries
compare both lists and isolate duplicates
delete the duplicates
hardlink the originals to the deleted duplicate paths

Yes, I asked ChatGPT and Qwen3 (which I host on a dedicated AI machine). Naturally, they suggested tools like rfind, rdfind, dupes, rdupes, rmlint... But hashing 30TB of media would take days, so I gave up quickly.

In the end, I only needed to find .mkv files, and duplicates have the exact same name as the originals, which simplifies things a lot. A simple Bash script would do the job.

Spare you the endless Q&A with ChatGPT—I was disappointed. Qwen3 was much cleaner. ChatGPT kept pushing awk-based solutions, which fail on paths with spaces. With Qwen’s help and dropping awk, the results improved significantly.

To test, I first asked for a script that only lists and compares:

#!/bin/bash

# Create an associative array to store duplicates
declare -A seen

# Find all .mkv files only (exclude directories)
find /media/seedbox /media/movies /media/tvseries -type f -name "*.mkv" -print0 | \
while IFS= read -r -d '' file; do
    # Get the file's inode and name
    inode=$(stat --format="%i" "$file")
    filename=$(basename "$file")
    
    # If the filename has been seen before
    if [[ -n "${seen[$filename]}" ]]; then
        # Check if the inode is different from the previous one
        if [[ "${seen[$filename]}" != "$inode" ]]; then
            # Output the duplicates with full paths
            echo "Duplicates for \"$filename\":"
            echo "${seen["$filename"]} ${seen["$filename:full_path"]}"
            echo "$inode $file"
            echo
        fi
    else
        seen[$filename]="$inode"
        seen["$filename:full_path"]="$file"
    fi
done

This gave me outputs like:

Duplicates for "episode1.mkv":
1234567 /media/seedbox/sonarr/Serie 1/Season1/episode1.mkv
2345678 /media/tvseries/Serie 1/Season1/episode1.mkv

With awk, it would’ve stopped at /media/seedbox/sonarr/Serie. I’m far from an expert, but Qwen3 performed better and explained everything clearly.

Once I verified the output, I asked for a complete script: compare, delete duplicates, create hardlinks.

Again, ChatGPT disappointed. Despite my requests, it created hardlinks before deleting the duplicates—effectively linking and then deleting the link (though the original is kept). Not helpful.

Quick stopover to Qwen3, RTX 5090 in overdrive, and bam—much better result. Yes, it kept ChatGPT-style emojis, but here it is:

#!/bin/bash

echo "🔍 Step 1: Indexing original files in /media/seedbox..."
declare -A seen

# Index all .mkv files in seedbox
while IFS= read -r -d '' file; do
    filename=$(basename "$file")
    seen["$filename"]="$file"
done < <(find /media/seedbox -type f -name "*.mkv" -print0)

echo "📦 Step 2: Automatically replacing duplicates..."
total_doublons=0
total_ko_saved=0

while IFS= read -r -d '' file; do
    filename=$(basename "$file")
    original="${seen[$filename]}"

    if [[ -n "$original" && "$original" != "$file" ]]; then
        inode_orig=$(stat -c %i "$original")
        inode_dupe=$(stat -c %i "$file")

        if [[ "$inode_orig" != "$inode_dupe" ]]; then
            size_kb=$(du -k "$file" | cut -f1)
            echo "🔁 Replacing:"
            echo "    Duplicate : $file"
            echo "    Original  : $original"
            echo "    Size      : ${size_kb} KB"

            rm "$file" && ln "$original" "$file" && echo "✅ Hardlink created."

            total_doublons=$((total_doublons + 1))
            total_ko_saved=$((total_ko_saved + size_kb))
        fi
    fi
done < <(find /media/movies /media/tvseries -type f -name "*.mkv" -print0)

echo ""
echo "🧾 Summary:"
echo "    🔗 Duplicates replaced by hardlink: $total_doublons"
echo "    💾 Approx. disk space saved: ${total_ko_saved} KB (~$((total_ko_saved / 1024)) MB)"
echo "✅ Done."

So, in conclusion, I:

Learned many Bash subtleties
Learned never to blindly copy-paste a ChatGPT script without understanding and dry-running it
Learned that Qwen on a RTX 5090 is more coherent than ChatGPT-4o on server farms (not even mentioning “normal” ChatGPT)
Learned that even with 100TB of storage, monitoring it would’ve alerted me much earlier to the 12TB of duplicates lying around

PythonLumeex

BashLUKS Backup