上面的答案依賴於您的文本與您的示例非常相似。此代碼稍微靈活一些,可以匹配文本中的任意數量的電子郵件。我沒有完整地記錄它,但是......
harvest_emails採用線分隔的字符串的字符串,每個這樣的逗號分隔在你的例子,date
,message_string
,identifier
,並返回產生3發電機長度元組(date,comma-sep-emails,identifier)
。它將從文本中提取任意數量的電子郵件並匹配任何形式爲的電子郵件,其中x是非空白字符的非零長度系列。
def harvest_emails(target):
""""Takes string, splits it on \n, then yields each line formatted as:
datecode, email, identifier
"""
import re
for line in target.splitlines():
t = line.split(",")
yield (
t[0].strip(),
','.join(
re.findall("\[email protected]\S+\.(?:com|org|net)",
''.join(t[1:-1]).strip(),re.I)[0:]),
t[-1].strip())
。
>>>messages = """04:34:03 +0000 2013,Email me for tickets email me at [email protected],1708824644
Tue Dec 17 04:33:58 +0000 2013,@musclepotential ok, man. you can email [email protected],25016561
Tue Dec 17 04:34:03 +0000 2013, [email protected], [email protected],1708824644
Tue Dec 17 04:33:58 +0000 2013, [email protected],25016561"""
>>>data = list()
>>>for line in harvest_emails(messages):
d = dict()
d["date"],d["emails"],d["id"] = line[0],line[1].split(','),line[2]
data.append(d)
>>>for value in data:
print(value)
{'emails': ['[email protected]'], 'date': '04:34:03 +0000 2013', 'id': '1708824644'}
{'emails': ['[email protected]'], 'date': 'Tue Dec 17 04:33:58 +0000 2013', 'id': '25016561'}
{'emails': ['[email protected]', '[email protected]'], 'date': 'Tue Dec 17 04:34:03 +0000 2013', 'id': '1708824644'}
{'emails': ['[email protected]'], 'date': 'Tue Dec 17 04:33:58 +0000 2013', 'id': '25016561'}
輸入是什麼樣的? – inspectorG4dget
我很確定'\ w +'不夠好。那麼'joe.smith @ gmail.com'呢? – mgilson