刪除重複的電子郵件

我想在scrapy中使用正則表達式來查找頁面上的所有電子郵件地址。刪除重複的電子郵件

我使用這個代碼：

item["email"] = re.findall('[\w\.-][email protected][\w\.-]+', response.body)

幾乎完美的工作原理：它抓住所有的電子郵件，並讓他們給我。然而，我想要的是：即使有多個相同的電子郵件地址，它在實際解析之前也不會重複。

我越來越喜歡這個反應（這是正確的）：

{'email': ['[email protected]', 
      '[email protected]', 
      '[email protected]', 
      '[email protected]', 
      '[email protected]']}

但是我想只顯示如果你想在如何拋出這將是

{'email': ['[email protected]', 
      '[email protected]', 
      '[email protected]']}

唯一地址只收集電子郵件，而不是，

'[email protected]'

這也有幫助。

謝謝大家！

來源

2016-04-15 Max Uland

你爲什麼要使用一個正則表達式解析響應

item["email"] = set(re.findall(r'[\w.-][email protected](?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b', response.body))

？似乎它可能更適合xpath或css選擇器。使用正則表達式解析HTML通常是一個糟糕的主意 –

因爲這被使用在一個廣泛的爬蟲中，數據將存儲在不同的地方。所以沒有一個xpath不會工作 –

這裏是你如何能擺脫受騙者和'[email protected]'般一樣的東西在你的輸出：

import re 
p = re.compile(r'[\w.-][email protected](?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b') 
test_str = "{'email': ['[email protected]',\n   '[email protected]',\n   '[email protected]',\n   '[email protected]',\n   '[email protected]']}" 
print(set(p.findall(test_str)))

見Python demo

正則表達式看起來像

[\w.-][email protected](?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b 
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  ^^

參見demo

負先行(?![\w.-]*\.(?:png|jpe?g|gif)\b)將禁止所有比賽用png，jpg等在字的端部延伸部（\b是一個字邊界，並且在這種情況下，它是一個尾隨字邊界）。

可以通過set輕鬆刪除複製品 - 這是最麻煩的部分。

最終的解決方案：

來源

2016-04-15 23:40:20

與'（？：png | jpe？g | gif）很好的接觸' – idjaw

不知道爲什麼，但是當我使用這段代碼時，它沒有給出任何郵件，但它只與item [「email」] = set（re.findall （'[\ w \ .-] + @ [\ w \ .-] +'，response.body））刪除重複項。雖然我很感興趣知道爲什麼它不顯示在我的結果中。由於我遵循該演示頁面（AWESOME BTW），並且按預期工作：/ –

對不起，我添加了'r'前綴來將該字符串標記爲原始字符串文字。現在，'\ b'被視爲字邊界，而不是退格字符。使用'item [「email」] = set（re.findall（r'[\ w .-] + @（？！[\ w .-] * \。（?: png | jpe？g | gif）\ b ）[\ w .-] + \ b'，response.body））' –

item["email"] = set(re.findall('[\w\.-][email protected][\w\.-]+', response.body))

來源

2016-04-15 23:38:27

額外布朗尼點忽略「footer-stanford-logo @ 2x.png''。 :) +1雖然 – idjaw

不需要轉義字符類中的'.'。它真的不會幫助那些PNG。如果這個或托馬斯'被接受，這個問題將是[在python中使用正則表達式返回唯一匹配]（http://stackoverflow.com/questions/32083145/returning-unique-matches-using-regex-in-蟒蛇）。 @idjaw：檢查我的答案，我建議一種方式來忽略PNG。 –

謝謝Wiktor，如果它是一個Dupe我非常抱歉，我沒有完全理解正則表達式，所以如果它被回答我道歉我一定不明白 –

難道你不能只使用一套而不是一個列表？

item["email"] = set(re.findall('[\w\.-][email protected][\w\.-]+', response.body))

如果你真的想那麼一個列表：

item["email"] = list(set(re.findall('[\w\.-][email protected][\w\.-]+', response.body)))

來源

2016-04-15 23:38:57

設置正確！ –

刪除重複的電子郵件

回答

相關問題