無法找到BeautifulSoup的所有鏈接，以從網站中提取鏈接（鏈接標識）

-1

我使用此處的代碼（retrieve links from web page using python and BeautifulSoup）從網站中提取所有鏈接。無法找到BeautifulSoup的所有鏈接，以從網站中提取鏈接（鏈接標識）

import httplib2 
from BeautifulSoup import BeautifulSoup, SoupStrainer 

http = httplib2.Http() 
status, response = http.request('http://www.bestwestern.com.au') 

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): 
    if link.has_attr('href'): 
     print link['href']

我使用這個網站http://www.bestwestern.com.au作爲測試。不幸的是，我注意到代碼並沒有提取一些鏈接，例如這個http://www.bestwestern.com.au/about-us/careers/。我不知道爲什麼。在頁面的代碼中，這是我發現的。

<li><a href="http://www.bestwestern.com.au/about-us/careers/">Careers</a></li>

我認爲提取器通常應該識別它。在BeautifulSoup文檔中，我可以閱讀：「最常見的意外行爲類型是，您無法找到您知道在文檔中的標籤。你看到它進入，但find_all（）返回[]或find（）返回None。這是Python內置的HTML解析器的另一個常見問題，它有時會跳過它不理解的標籤。再次，解決方案是安裝lxml或html5lib。「所以我安裝了html5lib。但我仍然有同樣的行爲。

謝謝您的幫助

來源

2016-09-19 BND

我實際上沒有看到「招聘」鏈接此頁面上 - 我們尋找到相同的頁面.. – alecxe

你會看到「職業生涯」的鏈接通過查看網站地圖在這裏HTTP：// WWW？ .bestwestern.com.au/sitemap/ – BND

的一個問題是 - 你正在使用BeautifulSoup版本3未維持了。您需要升級到BeautifulSoup version 4：

pip install beautifulsoup4

的另一個問題是，有沒有「職業生涯」鏈接的主頁上，但有一個「網站地圖」頁面上 - 請求，並使用默認html.parser解析器解析 - 你會看到「職業生涯」的鏈接以及其他印刷：

import requests 
from bs4 import BeautifulSoup, SoupStrainer 

response = requests.get('http://www.bestwestern.com.au/sitemap/') 

for link in BeautifulSoup(response.content, "html.parser", parse_only=SoupStrainer('a', href=True)): 
    print(link['href'])

注意我是如何提出的「必須有HREF」規則湯過濾器。

來源

2016-09-19 22:04:52 alecxe

我有BeautifulSoup的版本4，但仍無法找到鏈接。我不知道默認的解析器是否爲Python的內置HTML解析器，但我認爲問題可能來自於這一方。 – BND

這是Python內置HTML解析器的另一個常見問題，它有時會跳過它不理解的標籤。同樣，解決方案是安裝lxml或html5lib。「所以我安裝了html5lib。但我仍然有同樣的行爲。 – BND

@BND nono，正如我所問 - 主頁上沒有「careers」鏈接，但在「sitemap」頁面上有一個鏈接 - 更新了答案中的代碼 - 對我來說工作原樣並打印出「 carrers「鏈接。 – alecxe

好吧，這是一個古老的問題，但我在我的搜索中偶然發現了它，它似乎應該是相對簡單的完成。我沒有從httplib2切換到請求。

import requests 
from bs4 import BeautifulSoup, SoupStrainer 
baseurl = 'http://www.bestwestern.com.au' 

SEEN_URLS = [] 
def get_links(url): 
    response = requests.get(url) 
    for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a', href=True)): 
     print(link['href']) 
     SEEN_URLS.append(link['href']) 
     if baseurl in link['href'] and link['href'] not in SEEN_URLS: 
      get_links(link['href']) 

if __name__ == '__main__': 
    get_links(baseurl)

來源

2017-01-04 13:52:35 StoneyD

無法找到BeautifulSoup的所有鏈接，以從網站中提取鏈接（鏈接標識）

回答

相關問題