2013-08-27 52 views
2

假設我有兩個目錄,名爲dir_onedir_two。在每個目錄中,我都有一個名爲data.txt的文本文件。換句話說,在兩個單獨的目錄中有兩個文件:/dir_one/data.txt/dir_one/data.txt儘管有相似的文件名,兩個文本文件可能有也可能不具有相同的內容!如何比較文本文件和刪除重複項(Linux終端命令)

我所試圖做的是這樣的:

  1. 比較文本文件的內容:./dir_one/data.txt和./dir_one/data.txt
  2. 如果內容是相同的,刪除其中一個文本文件。

我已經進入在命令終端如下:

diff -qrs ./dir_one/data.txt ./dir_two/data.txt 

,我收到以下消息:

Files ./dir_one/data.txt ./dir_two/data.txt are identical. 

現在我知道這兩個文本文件是相同的,我可以使用rm命令刪除其中的一個。到現在爲止還挺好。然而...

問題是,我想自動刪除過程。我不想在命令行輸入rm。例如,是否有任何可能的方法來執行此操作 - 例如在腳本中?

我還想知道如何將一個目錄中的一大組文本文件與另一個目錄中的一大組文本文件進行比較。同樣,對於任何發現相同的文件,應刪除其中一個重複項。這可能嗎?

我發現了類似的問題,但沒有一個關於自動刪除其中一個重複文件。請注意,我使用的是Ubuntu 12.04。

回答

4

你需要fdupes。

fdupes -r /some/directory/path > /some/directory/path/fdupes.log 

享受!

+0

謝謝,@UberDoyle。這是非常有用的信息。乾杯。 –

1

diff如果文件相同,則返回退出狀態0;如果它們不同,則返回1;如果有錯誤,則返回2。你可以用它來決定執行rm命令

diff file1 file2 && rm file2 
+0

謝謝@Jim Garrison。這看起來像我之後。你能告訴我,這是否可以在遞歸比較子目錄中的文件的情況下工作,即指定選項:-r?乾杯。 –

+1

您需要在自己的過程中運行每個比較。我懷疑如果你用'-r'運行它,如果_all_文件是相等的,結果將是零,如果有任何不同的話,結果將是非零。 –

0

這是我最近寫的一段腳本,最近剛剛寫過。您應該從您想要進行重複數據刪除的目錄中運行它。它會將所有副本放置在「已清除」目錄之外的目錄中:

#!/bin/bash 

# this script walks through all files in the current directory, 
# checks if there are duplicates (it compares only files with 
# the same size) and moves duplicates to $duplicates_dir. 
# 
# options: 
# -H remove hidden files (and files in hidden folders) 
# -n dry-run: show duplicates, but don't remove them 
# -z deduplicate empty files as well 

while getopts "Hnz" opts; do 
    case $opts in 
    H) 
     remove_hidden="yes";; 
    n) 
     dry_run="yes";; 
    z) 
     remove_empty="yes";; 
    esac 
done 

# support filenames with spaces: 
IFS=$(echo -en "\n\b") 

working_dir="$PWD" 
working_dir_name=$(echo $working_dir | sed 's|.*/||') 

# prepare some temp directories: 
filelist_dir="$working_dir/../$working_dir_name-filelist/" 
duplicates_dir="$working_dir/../$working_dir_name-duplicates/" 
if [[ -d $filelist_dir || -d $duplicates_dir ]]; then 
    echo "ERROR! Directories:" 
    echo " $filelist_dir" 
    echo "and/or" 
    echo " $duplicates_dir" 
    echo "already exist! Aborting." 
    exit 1 
fi 
mkdir $filelist_dir 
mkdir $duplicates_dir 

# get information about files: 
find -type f -print0 | xargs -0 stat -c "%s %n" | \ 
    sort -nr > $filelist_dir/filelist.txt 

if [[ "$remove_hidden" != "yes" ]]; then 
    grep -v "/\." $filelist_dir/filelist.txt > $filelist_dir/no-hidden.txt 
    mv $filelist_dir/no-hidden.txt $filelist_dir/filelist.txt 
fi 

echo "$(cat $filelist_dir/filelist.txt | wc -l)" \ 
    "files to compare in directory $working_dir" 
echo "Creating file list..." 

# divide the list of files into sublists with files of the same size 
while read string; do 
    number=$(echo $string | sed 's/\..*$//' | sed 's/ //') 
    filename=$(echo $string | sed 's/.[^.]*\./\./') 
    echo $filename >> $filelist_dir/size-$number.txt 
done < "$filelist_dir/filelist.txt" 

# plough through the files 
for filesize in $(find $filelist_dir -type f | grep "size-"); do 
    if [[ -z $remove_empty && $filesize == *"size-0.txt" ]]; then 
     continue 
    fi 

    filecount=$(cat $filesize | wc -l) 
    # there are more than 1 file of particular size -> 
    # these may be duplicates 
    if [ $filecount -gt 1 ]; then 
     if [ $filecount -gt 200 ]; then 
      echo "" 
      echo "Warning: more than 200 files with filesize" \ 
       $(echo $filesize | sed 's|.*/||' | \ 
       sed 's/size-//' | sed 's/\.txt//') \ 
       "bytes." 
      echo "Since every file needs to be compared with" 
      echo "every other file, this may take a long time." 
     fi 

     for fileA in $(cat $filesize); do 
      if [ -f "$fileA" ]; then 
       for fileB in $(cat $filesize); do 
        if [ -f "$fileB" ] && [ "$fileB" != "$fileA" ]; then 
         # diff will exit with 0 iff files are the same. 
         diff -q "$fileA" "$fileB" 2> /dev/null > /dev/null 
         if [[ $? == 0 ]]; then 
          # detect if one filename is a substring of another 
          # so that in case of foo.txt and foo(copy).txt 
          # the script will remove foo(copy).txt 
          # supports filenames with no extension. 

          fileA_name=$(echo $fileA | sed 's|.*/||') 
          fileB_name=$(echo $fileB | sed 's|.*/||') 
          fileA_ext=$(echo $fileA_name | sed 's/.[^.]*//' | sed 's/.*\./\./') 
          fileB_ext=$(echo $fileB_name | sed 's/.[^.]*//' | sed 's/.*\./\./') 
          fileA_name="${fileA_name%%$fileA_ext}" 
          fileB_name="${fileB_name%%$fileB_ext}" 

          if [[ $fileB_name == *$fileA_name* ]]; then 
           echo " $(echo $fileB | sed 's|\./||')" \ 
            "is a duplicate of" \ 
            "$(echo $fileA | sed 's|\./||')" 
           if [ "$dry_run" != "yes" ]; then 
            mv --backup=t "$fileB" $duplicates_dir 
           fi 
          else 
           echo " $(echo $fileA | sed 's|\./||')" \ 
            "is a duplicate of" \ 
            "$(echo $fileB | sed 's|\./||')" 
           if [ "$dry_run" != "yes" ]; then 
            mv --backup=t "$fileA" $duplicates_dir 
           fi 
          fi 
         fi 
        fi 
       done 
      fi 
     done 
    fi 
done 

rm -r $filelist_dir 

if [ "$dry_run" != "yes" ]; then 
    echo "Duplicates moved to $duplicates_dir." 
fi