從XML中提取Python中的URL

-1

我閱讀了關於從字符串中提取url的主題。 https://stackoverflow.com/a/840014/326905 真的很好，我得到了來自含有http://www.blabla.com一個XML文檔的所有URL與從XML中提取Python中的URL

>>> s = '<link href="http://www.blabla.com/blah" /> 
     <link href="http://www.blabla.com" />' 
>>> re.findall(r'(https?://\S+)', s) 
['http://www.blabla.com/blah"', 'http://www.blabla.com"']

但我無法弄清楚，如何自定義正則表達式的URL的末尾省略雙qoute。

首先，我認爲這就是線索

re.findall(r'(https?://\S+\")', s)

或本

re.findall(r'(https?://\S+\Z")', s)

，但事實並非如此。

有人可以幫助我，告訴我如何在最後省略雙引號？

Btw。 https的「s」後面的問號意味着「s」可能發生或不能發生。我對嗎？

來源

2013-03-21 surfi

永遠永遠永遠永遠解析與正則表達式http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html HTML – That1Guy 2013-03-21 14:40:49

你也應該閱讀線程[ RegEx匹配除XHTML自包含標籤之外的開放標籤]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags） – Abhijit 2013-03-21 14:41:42

如果您使用HTML解析器BeautifulSoup，這個問題比使用正則表達式更容易。 – 2013-03-21 14:41:47

你已經在使用一個字符類（儘管是一個簡寫版本）。我可能會建議稍微修改角色類別，這樣你就不需要向前看。只需添加引號爲字符類的一部分：

re.findall(r'(https?://[^\s"]+)', s)

這仍然說「一個或多個字符不一個空白，」但加不包括雙引號無論是。所以整體表達式是「一個或多個字符不是一個空格，並且不是一個雙引號」。

來源

2013-03-21 15:06:50

你想要的雙引號顯示爲前瞻：

re.findall(r'(https?://\S+)(?=\")', s)

這樣，他們將不會出現作爲比賽的一部分。另外，是的，?表示該字符是可選的。

請看這裏的例子：http://regexr.com?347nk

來源

2013-03-21 14:42:49 Daedalus

謝謝。我剛剛讀了這個https://stackoverflow.com/a/13057368/326905

並簽出這也是工作。

re.findall(r'"(https?://\S+)"', urls)

來源

2013-03-21 14:46:24 surfi

是的，但如果在文本中有一個URL與其他字符，如「>」，這將無法正常工作。例如，對於本文：「asd http://www.blabla.com> asdf」它將返回： ['http://www.blabla.com>']這是錯誤的！ – 2013-03-21 14:57:05

這不會發生。這是一個有效的XML，但謝謝。 – surfi 2013-03-21 15:19:44

我通過使用這段代碼，以從文本中提取網址：

url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))') 
# convert string to lower case 
text = text.lower() 
matches = re.findall(url_rgx, text) 
# patch the 'http://' part if it is missed 
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches] 
print urls

它的偉大工程！

來源

2013-03-21 14:46:31

>>>from lxml import html 
>>>ht = html.fromstring(s) 
>>>ht.xpath('//a/@href') 
['http://www.blabla.com/blah', 'http://www.blabla.com']

來源

2013-03-21 15:09:25 Drover

從XML中提取Python中的URL

回答

相關問題