使用Python從數據轉儲中提取電子郵件

我有一個數據轉儲，我試圖從中提取所有電子郵件。使用Python從數據轉儲中提取電子郵件

這是我一直在使用BeautifulSoup

編寫的代碼

import urllib2 
import re 
from bs4 import BeautifulSoup 
url = urllib2.urlopen("file:///users/home/Desktop/emails.html").read() 
soup = BeautifulSoup(url) 
email = raw_input(soup) 
match = re.findall(r'&lt;(.*?)&gt;', email) 
if match: 
    print match

樣本數據轉儲

<tr><td><a href="http://abc.gov.com/comments/24-April/file.html">for educational purposes only</a></td> 
<td>7418681641 &lt;[email protected]&gt;</td> 
<td>[email protected]</td> 
<td nowrap="">24-04-2015 10.31</td> 
<td align="center">&nbsp;</td></tr> 
<tr><td><a href="http://abc.gov.com/comments/24-April/test.html">no_subject</a></td> 
<td>John &lt;[email protected]&gt;</td> 
<td>[email protected]</td> 
<td nowrap="">24-04-2015 11.28</td> 
<td align="center">&nbsp;</td></tr> 
<tr><td><a href="http://abc.gov.com/comments/24-April/test.html">something</a></td> 
<td>Mark &lt;[email protected]&gt;</td> 
<td>[email protected]</td> 
<td nowrap="">24-04-2015 11.28</td> 
<td align="center">&nbsp;</td></tr> 
<tr><td><a href="http://abc.gov.com/comments/24-April/abc.html">some data</a></td>

我可以清楚地看到，電子郵件是一個<和>標籤之間上市。我正在嘗試使用正則表達式來識別所有電子郵件並將其打印出來。然而，在執行時，不是隻提取電子郵件（每行一封電子郵件），而是整個文件正在打印。

我該如何解決這個問題？

來源

2016-06-07 Piyush

我不明白你的代碼在所有。你爲什麼使用'urllib2'打開一個本地文件？只需使用open（「/ path/to/file.html」）作爲f：soup = BeautifulSoup（f）'。接下來，你期望'raw_input（soup）'做什麼？最後，當你剛開始使用HTML解析器時，爲什麼要對文本進行正則表達式搜索？ – MattDMo

@MattDMo：啊，是的，你是正確的先生。可以簡單地打開它。不知道raw_input從用戶接收輸入。我假設它會將湯變量解析爲一個字符串。沒有raw_input行，我得到一個錯誤，指出re.findall函數期望字符串作爲字符串中的第二個參數 – Piyush

-1

您可以使用find_all方法BeautifulSoup解析到您要查找的標籤。這是代碼。（我已經存儲了樣本文件a.html）

from bs4 import BeautifulSoup 
url = open("a.html",'r').read() 
soup = BeautifulSoup(url) 
rows = soup.find_all('tr') # find all rows using tag 'tr' 
for row in rows: 
    cols = row.find_all('td') # find all columns using 'td' tag 
    if len(cols)>1: 
     email_id_string = cols[1].text # get the text of second element of list (contains email id element) 
     email_id = email_id_string[ email_id_string.find("<")+1 : email_id_string.find(">") ] (get only the email id between <and>) 
     print email_id

來源

2016-06-07 16:44:37 Tanu

這不起作用，因爲有許多''元素不包含電子郵件地址。 – MattDMo

如果電子郵件ID存在，它的現在作爲第二列，因此我已檢查使用'如果條件' – Tanu

不，你沒有。你只是檢查是否有多個''元素每個''，如果是這樣，你拿第二個td，並假定它包含一個電子郵件，這對任意的HTML是不是一個有效的假設。 OP公佈了一個非常簡單的例子，但我想實際的數據結構不如這個。您的解決方案需要比目前更強大*。 – MattDMo

你的榜樣實際工作

re.findall(r'\&lt;(.*?)\&gt;',your_data_bump)= 
['[email protected]', '[email protected]', '[email protected]']

來源

2016-06-08 08:39:08 marmouset

感謝這實際上工作。我只是將這一行更改爲match = re.findall（r'<（。*？）>'，str（email）），然後打印這些值。 – Piyush

使用Python從數據轉儲中提取電子郵件

回答

相關問題