循環刮板

我想要從http://www.basketball-reference.com/boxscores/201112250DAL.html刮取的常規賽季中的所有比賽的數據。我得到所有其他數據農業功能正常工作，我遇到的問題是與循環刮板。這是我用來獲取下一頁網址的測試代碼。我可以使用它來獲取常規賽中所有66場比賽的數據，但是這種打字方式的打字很多。什麼是最簡單的自動化解決方案？循環刮板

謝謝！

URL = "http://www.basketball-reference.com/boxscores/201112250DAL.html" 

html = urlopen(URL).read() 
soup = BeautifulSoup(html) 

def getLink(html, soup): 
    links = soup.findAll('a', attrs={'class': 'bold_text'}) 
    if len(links) == 2: 
     a = links[0] 
     a = str(a) 
     a = a[37:51] 
     return a 
    if len(links) == 3: 
     a = links[1] 
     a = str(a) 
     a = a[37:51] 
     return a 
    if len(links) == 4: 
     a = links[3] 
     a = str(a) 
     a = a[37:51] 
     return a 

print getLink(html, soup) 
URL1 = "http://www.basketball-reference.com/boxscores" + getLink(html, soup) + "html" 
print URL1 
html1 = urlopen(URL1).read() 
soup1 = BeautifulSoup(html1) 

print getLink(html1, soup1)

來源

2012-12-17 user1851527

如果你只是想解決網址問題，那麼只需要抓取http://www.basketball-reference.com/teams/DAL/2012_games.html並從字面上去掉字符串像「/boxscores/*.html」？這會給你帶來66場常規賽和季後賽。 – tanantish

這只是一個測試，去低谷網址，主代碼中有功能，這些功能需要每個遊戲感興趣的數據。我只是想知道如何儘可能快地完成這個過程。 – user1851527

我在想使用/DAL/2012_games.html頁面作爲索引嗎？您可以檢索一次，然後輕鬆獲取所需的66個URL，並將其粘貼到列表中，然後輸入。通過繞過整個頁面以獲取正確的「下一個遊戲」鏈接（因爲我沒有看到任何內容簡單的模式匹配） – tanantish

最簡單的方法是去http://www.basketball-reference.com/teams/DAL/2012_games.html，做這樣的事情：

URL = 'http://www.basketball-reference.com/teams/DAL/2012_games.html' 
html = urllib.urlopen(URL).read() 
soup = BeautifulSoup(html) 

links = soup.findAll('a',text='Box Score')

這將返回所有<a>標籤與「盒子」分數的文本列表。用此測試：

for link in links: 
    print link.parent['href'] 
    page_url = 'http://www.basketball-reference.com' + link.parent['href']

從這裏，請發送另一個請求到page_url並繼續編碼。

這是我用整個代碼，它完美地工作對我來說：

from BeautifulSoup import BeautifulSoup 
import urllib 


url = 'http://www.basketball-reference.com/teams/DAL/2012_games.html' 
file_pointer = urllib.urlopen(url) 
soup = BeautifulSoup(file_pointer) 

links = soup.findAll('a',text='Box Score') 
for link in links: 
    print link.parent['href']

來源

2012-12-17 20:03:46 That1Guy

Thnx，我從來沒有想過這樣嘗試。給你的想法一個快速測試，它沒有工作;我得到了一個KeyError：'href'。 – user1851527

你確定你想要父母的href嗎？在這種情況下，你將會得到一個KeyError，因爲A標籤的父親將會是一個TD，它將拋出KeyError。如果你改爲訪問鏈接的href屬性（'link ['href']'而不是'link.parent ['href']'）， – tanantish

最簡單的最簡單的方法是使用scrapy。它會自動跟蹤鏈接。

它允許您輕鬆地創建複雜的規則，以便在哪些url上遵循和忽略。然後，Scrapy會跟隨任何符合您規則的網址。它確實需要你學習scrapy的工作原理，但它們提供了一個關於如何開始的極好的快速教程。

來源

2012-12-18 14:42:52 dm03514

回答

相關問題