使用BeautifulSoup從html頁面提取鏈接

我需要從Piography網站提取一些文章。使用BeautifulSoup從html頁面提取鏈接

所以從這個網頁http://www.biography.com/people我需要所有的子鏈接。例如：

/people/ryan-seacrest-21095899 
/people/edgar-allan-poe-9443160

但我有兩個問題：

1 - 當我試圖找到一個所有<一>。我無法找到我需要的href。

import urllib2 
from BeautifulSoup import BeautifulSoup 
url = "http://www.biography.com/people" 
text = urllib2.urlopen(url).read() 
soup = BeautifulSoup(text) 
divs = soup.findAll('a') 
for div in divs: 
    print(div)

2-有一個「看到更多」按鈕。所以我怎樣才能把網站上所有人的所有鏈接都拿走。不只是出現在第一頁？

來源

2017-05-03 user1927468

你必須使用硒這 –

在網站上顯示的內容，使用angular和JS生成的部分內容。 BeautifulSoup不執行JS。您需要使用http://selenium-python.readthedocs.io/或其他類似的樂器。或者你可以在ajax中撬動你的GET（或可能是POST）方法，並通過他提供數據。

來源

2017-05-03 10:49:22

正是我所需要的。謝謝:)） – user1927468

偉大的職位，我建議使用PhantomJS與硒。您還可以使用BeautifulSoup管理Selenium源文件解析頁面，使用方法 driver.page_source 並使用.execute_script 執行特定的JS或使用.get_attribute（ –

使用BeautifulSoup從html頁面提取鏈接

回答

相關問題