將URL從文本塊中取出？

我有一大塊文字，想分析出所有網址，並返回遵循此模式的網址列表：https://www.facebook.com/。* $。將URL從文本塊中取出？

這裏是文本的一個例子，我想從解析：

<abbr title="Monday xxxx" data-utime="xx" class="timestamp">over a year ago</abbr></div></div></div></div></div></li><li class="fbProfileBrowserListItem"><div class="clearfix _5qo4"><a class="_8o _8t lfloat" href="https://www.facebook.com/xxxxx?fref=pb&amp;hc_location=profile_browser" tabindex="-1" aria-hidden="true" data-hovercard="/ajax/hovercard/user.php?id=xxxx&amp;extragetparams=%7B%22hc_location%22%3A%22profile_browser%22%7D"><img class="_s0 _rw img" src="https://fbcdn-profile-xxxxxxxx.net/hprofile-ak-ash2/xxxxxx.jpg" alt=""></a><div class="clearfix _42ef"><div class="_6a rfloat"><div class="_6a _6b" style="height:50px"></div><div class="_6a _6b"><div class="_5t4x"><div class="FriendButton" id="u_2h_1w"><button class="_42ft _4jy0 FriendRequestAdd addButton _4jy3 _517h" type="button">

而且我想獲得「https://www.facebook.com/xxxxx?fref=pb&hc_location=profile_browser」

我試了一下

from bs4 import BeautifulSoup 
html = open('full_page_firefox.html') 
def getLinks(html): 
    soup = BeautifulSoup(html) 
    anchors = soup.findAll('a') 
    links = [] 
    for a in anchors: 
     links.append(a['href']) 
    return links 
print getLinks(html)

分裂也似乎不起作用，因爲它不保留模式。因此，如果我使用諸如「https://www.facebook.com/ *。$」之類的東西來獲取帶有re.split（）或其他東西的URL，它就不起作用。

來源

2013-11-25 goldisfine

希望這篇博文對此有所幫助。 http://samranga.blogspot.com/2015/08/web-scraping-beginner-python.html –

你的代碼在這裏工作，檢查你的輸入文件，確保美麗的肥皂可以解析它。

順便說一句，也可以考慮使用LXML

from lxml import etree 
print etree.parse('full_page_firefox.html').xpath('//a/@href | //img/@src') 

['https://www.facebook.com/xxxxx?fref=pb&hc_location=profile_browser', 
'https://fbcdn-profile-xxxxxxxx.net/hprofile-ak-ash2/xxxxxx.jpg']

來源

2013-11-25 02:55:05

你的功能工作。我將您提供的html位複製到一個html文件中，並添加了<html>和<body>標籤以獲得較好的效果。

然後我嘗試：

with open('C:/users/brian/desktop/html.html') as html: 
    print getLinks(html)

在Python解釋器，得到了以下的輸出：

[u'https://www.facebook.com/xxxxx?fref=pb&hc_location=profile_browser']

呼叫str這個，你是好

來源

2013-11-25 03:01:01 Totem

可以檢查網址通過這種模式，經過BS解析後，如下：

from bs4 import BeautifulSoup 
import re 
html = open('full_page_firefox.html') 
def getLinks(html): 
    soup = BeautifulSoup(html) 
    anchors = soup.findAll('a') 
    links = [] 
    for a in anchors: 
     match_result = re.match(r'https://www.facebook.com/.*$', a['href']) 
     if match_result is not None: 
      links.append(match_result.string) 
    return links 
print getLinks(html)

注意： 1.在'/'和'。'之間沒有空格。 2.'$'匹配字符串結尾，小心使用

來源

2013-11-25 03:12:08

將URL從文本塊中取出？

回答

相關問題