從Python中的電子郵件中提取URL

感謝您提交至ourdirectory.com 網址：http://myurlok.us 請點擊以下鏈接以確認您的提交。 http://www.ourdirectory.com/confirm.aspx?id=1247778154270076 從Python中的電子郵件中提取URL

Once we receive your comfirmation, your site will be included for process! 
regards, 

http://www.ourdirectory.com 

Thank you!

應該很明顯我需要提取其URL。

來源

2009-11-24 Demon Labs

我很想知道你爲什麼要這麼做。這是爲了什麼？ – 2009-11-24 19:32:00

肯定是一個自動提交機器人的一些外部網站。 – 2009-11-24 19:33:14

是的，我喜歡成千上萬封需要確認的電子郵件。 – 2009-11-24 19:33:52

此方法適用於僅當源是不是HTML。

def extractURL(self,fileName): 

    wordsInLine = [] 
    tempWord = [] 
    urlList = [] 

    #open up the file containing the email 
    file=open(fileName) 
    for line in file: 
     #create a list that contains is each word in each line 
     wordsInLine = line.split(' ') 
     #For each word try to split it with : 
     for word in wordsLine: 
      tempWord = word.split(":") 
      #Check to see if the word is a URL 
      if len(tempWord) == 2: 
       if tempWord[0] == "http" or tempWord[0] == "https": 
        urlList.append(word) 

    file.close() 

    return urlList

來源

2009-11-25 09:12:00 apocolyp4

不容易。其中一項建議（從RegexBuddy庫中提取）：

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])

將匹配的網址（不mailto:，如果你想要那個，說的話），即使它們被括號括起來。如果以www.或ftp.開頭，也會匹配沒有http://或ftp://等的網址。

一個簡單的版本：

\bhttps?://[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]

這一切都取決於你的需求是什麼/你輸入的樣子。

來源

2009-11-24 19:33:33

我不認爲你需要變得如此花哨。我想他想從一個非常具體的來源解析非常特定的電子郵件，所以我想他可以解析確切的字符串：「http://www.ourdirectory.com/confirm.aspx?id=」後面跟着數字和結束符，的線。 – 2009-11-24 19:35:24

可能是的。雖然在那裏有另一個URL（myurlok.us）。誰知道還會發生什麼 - 他並不是很具體。 – 2009-11-24 19:37:21

正則表達式：

"http://www.ourdirectory.com/confirm.aspx\?id=[0-9]+$"

或沒有正則表達式，由線及測試解析電子郵件行，如果字符串中包含「http://www.ourdirectory.com/confirm.aspx?id=」如果確實如此，這就是你的URL。

當然，如果您的輸入實際上是HTML源代碼，而不是您發佈的文本，則這些內容都會顯示在窗口中。

來源

2009-11-24 19:39:44

如果是超鏈接的HTML電子郵件，您可以使用HTMLParse庫作爲快捷方式。

import HTMLParser 
class parseLinks(HTMLParser.HTMLParser): 
    def handle_starttag(self, tag, attrs): 
     if tag == 'a': 
      for name, value in attrs: 
       if name == 'href': 
        print value 
        print self.get_starttag_text() 

someHtmlContainingLinks = "" 
linkParser = parseLinks() 
linkParser.feed(someHtmlContainingLinks)

來源

2009-11-24 19:42:22

他的示例文檔對我來說看起來不像HTML :-) – Suppressingfire 2009-11-25 00:54:27

其他人都提供了非HTML解決方案，OP的問題歷史記錄表明他將其從支持HTML的gmail中提取出來。鑑於問題的模糊性，我認爲這是一個有效的迴應。 – 2009-11-30 14:45:31

@OP，如果你的郵件永遠是標準的，

f=open("emailfile") 
for line in f: 
    if "confirm your submission" in line: 
     print f.next().strip()   
f.close()

來源

2009-11-25 00:51:10 ghostdog74

看看這個。

我寫了一個相同的職位。本文中的代碼可以從電子郵件文件中提取URL，無論是純文本格式還是html內容類型，還是可引用打印或基本64位或7位編碼。

Python - How to extract URLs (plain/html, quote-printable/base64/7bit) from an email file

來源

2015-10-28 00:16:27 gixxer

從Python中的電子郵件中提取URL

回答

相關問題