2014-04-10 71 views
0

我有一個包含*.txt文件的文件夾。我想定期檢查這些文件是否有重複的URL。如何檢查重複網址的文本文件夾

其實,我救我的書籤在這些文件中,總是有至少兩條線,如:

www.domain.com 
Quite a popular domain name 

碰巧,我挽救了相同的URL與另一個說明,如:

www.domain.com 
I should buy this domain 
Whenever I happen to have enough money for this 

所有條目由單個空白行分隔。有時這些網址爲降價格式:

[domain.com](www.domain.com) 

如何抓取重複網址的文件夾?

我迄今發現的唯一的解決方案是結合cat與它的uniq管:

cat folder/* |sort|uniq|less > dupefree.txt 

的問題,這就是:

  1. 這並不只檢查全相同的行 - 降價網址忽略並且連接的評論丟失
  2. 我不想輸出已清除的文本文件,但只需要提示哪些URL是重複的

我該如何做適當的重複檢查?

回答

1

這裏是源文件,我從你的描述

cat file 

www.domain.com 
Quite a popular domain name 

www.domain.com 
I should buy this domain 
Whenever I happen to have enough money for this 
All entries are separated by single blank lines. And sometimes the URLs are in markdown format: 

[domain.com](www.domain.com) 
How would I crawl the folder for duplicate URLs? 

使用awk來導出重複的域名:

awk 'BEGIN{FS="\n";RS=""} 
{ if ($1~/\[/) { split($1,a,"[)(]"); domain[a[2]]++} 
    else {domain[$1]++} 
} 
END{ for (i in domain) 
     if (domain[i]>1) print "Duplicate domain found: ",i 
    }' file 

Duplicate domain found: www.domain.com