拉鍊接和刮python這些網頁

我想刮這個頁面的一些鏈接。拉鍊接和刮python這些網頁

http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/wnba/teams/pastresults/2012/team665231.html

這會得到我想要的鏈接。

boxurl = urllib2.urlopen(url).read() 
soup = BeautifulSoup(boxurl) 
boxscores = soup.findAll('a', href=re.compile('boxscore'))

我想從頁面上抓取每個boxscore。我已經制作了代碼來抓取比分，但是我不知道如何得到它們。

編輯

我想這樣會比較好，因爲它剔除了html標籤。我仍然需要知道如何打開它們。

for link in soup.find_all('a', href=re.compile('boxscore')): 
    print(link.get('href'))

EDIT2： 我這是怎麼湊一些數據從頁面的第一個鏈接。

url = 'http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/wnba/results/2012/boxscore841602.html' 


boxurl = urllib2.urlopen(url).read() 
soup = BeautifulSoup(boxurl) 
def _unpack(row, kind='td'): 
    return [val.text for val in row.findAll(kind)] 

tables = soup('table') 
linescore = tables[1] 
linescore_rows = linescore.findAll('tr') 
roadteamQ1 = float(_unpack(linescore_rows[1])[1]) 
roadteamQ2 = float(_unpack(linescore_rows[1])[2]) 
roadteamQ3 = float(_unpack(linescore_rows[1])[3]) 
roadteamQ4 = float(_unpack(linescore_rows[1])[4]) 

print roadteamQ1, roadteamQ2, roadteamQ3, roadteamQ4

但是，當我嘗試這個。

url = 'http://www.covers.com/pageLoader/pageLoader.aspx? page=/data/wnba/teams/pastresults/2012/team665231.html' 
boxurl = urllib2.urlopen(url).read() 
soup = BeautifulSoup(boxurl) 

tables = pages[0]('table') 
linescore = tables[1] 
linescore_rows = linescore.findAll('tr') 
roadteamQ1 = float(_unpack(linescore_rows[1])[1]) 
roadteamQ2 = float(_unpack(linescore_rows[1])[2]) 
roadteamQ3 = float(_unpack(linescore_rows[1])[3]) 
roadteamQ4 = float(_unpack(linescore_rows[1])[4])

我得到這個錯誤。 表= pages0類型錯誤：「STR」對象不是可調用

print pages[0]

吐出所有像正常的第一鏈路的HTML的。希望這不是太混亂。總而言之，我現在可以獲得鏈接，但仍然可以從中獲得。

來源

2013-05-12 user2333196

如果您正在使用的頁面作爲「爬行」的基礎 - 你可能希望看看[scrapy]（http://scrapy.org） – 2013-05-13 13:27:21

現在你的問題更清晰，你可能想看看以前的答案我做過http://stackoverflow.com/questions/ 15866297 /匹配特定表格內的html-beautifulsoup/15866957＃15866957所有你需要做的就是系統地工作，這並不難，只是單調乏味！ – Vorsprung 2013-05-13 18:30:02

我通常可以找到和刮我想要的表。在這種情況下，它是網頁上的第二個表格。所以tables = soup（'table'） linescore = tables [1]選擇我想要的表格。我遇到麻煩的地方是從鏈接打開網頁，然後選擇表格。 – user2333196 2013-05-13 19:20:30

像這樣拉找到的鏈接到一個數組中的所有頁面，所以第一頁頁[0]，第二頁[1]等

boxscores = soup.findAll('a', href=re.compile('boxscore')) 
basepath = "http://www.covers.com" 
pages=[] 
for a in boxscores: 
    pages.append(urllib2.urlopen(basepath + a['href']).read())

來源

2013-05-13 13:57:10 Vorsprung

是的，這工作，但我不能從它刮。我將在原始問題中更多地解釋這個問題。 – user2333196 2013-05-13 17:30:48

拉鍊接和刮python這些網頁

回答

相關問題