如何檢查重複網址的文本文件夾

我有一個包含*.txt文件的文件夾。我想定期檢查這些文件是否有重複的URL。如何檢查重複網址的文本文件夾

其實，我救我的書籤在這些文件中，總是有至少兩條線，如：

www.domain.com 
Quite a popular domain name

碰巧，我挽救了相同的URL與另一個說明，如：

www.domain.com 
I should buy this domain 
Whenever I happen to have enough money for this

所有條目由單個空白行分隔。有時這些網址爲降價格式：

[domain.com](www.domain.com)

如何抓取重複網址的文件夾？

我迄今發現的唯一的解決方案是結合cat與它的uniq管：

cat folder/* |sort|uniq|less > dupefree.txt

的問題，這就是：

這並不只檢查全相同的行 - 降價網址忽略並且連接的評論丟失
我不想輸出已清除的文本文件，但只需要提示哪些URL是重複的

我該如何做適當的重複檢查？

來源

2014-04-10 mcbetz

這裏是源文件，我從你的描述

cat file 

www.domain.com 
Quite a popular domain name 

www.domain.com 
I should buy this domain 
Whenever I happen to have enough money for this 
All entries are separated by single blank lines. And sometimes the URLs are in markdown format: 

[domain.com](www.domain.com) 
How would I crawl the folder for duplicate URLs?

使用awk來導出重複的域名：

awk 'BEGIN{FS="\n";RS=""} 
{ if ($1~/\[/) { split($1,a,"[)(]"); domain[a[2]]++} 
    else {domain[$1]++} 
} 
END{ for (i in domain) 
     if (domain[i]>1) print "Duplicate domain found: ",i 
    }' file 

Duplicate domain found: www.domain.com

來源

2014-04-10 11:22:26 BMW

如何檢查重複網址的文本文件夾

回答

相關問題