2013-05-20 35 views
0

我導入鏈接boxscores從這個網頁如何自動化這個beautifulsoup進口

http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/wnba/teams/pastresults/2012/team665231.html 

這就是我現在做它。我從第一頁獲得鏈接。

url = 'http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/wnba/teams/pastresults/2012/team665231.html' 

boxurl = urllib2.urlopen(url).read() 
soup = BeautifulSoup(boxurl) 

boxscores = soup.findAll('a', href=re.compile('boxscore')) 
basepath = "http://www.covers.com" 
pages=[]   # This grabs the links from the page 
for a in boxscores: 
pages.append(urllib2.urlopen(basepath + a['href']).read()) 

然後在新窗口中,我會這樣做。

newsoup = pages[1] # I am manually changing this every time 

soup = BeautifulSoup(newsoup) 
def _unpack(row, kind='td'): 
    return [val.text for val in row.findAll(kind)] 

tables = soup('table') 
linescore = tables[1] 
linescore_rows = linescore.findAll('tr') 
roadteamQ1 = float(_unpack(linescore_rows[1])[1]) 
roadteamQ2 = float(_unpack(linescore_rows[1])[2]) 
roadteamQ3 = float(_unpack(linescore_rows[1])[3]) 
roadteamQ4 = float(_unpack(linescore_rows[1])[4]) # add OT rows if ??? 
roadteamFinal = float(_unpack(linescore_rows[1])[-3]) 
hometeamQ1 = float(_unpack(linescore_rows[2])[1]) 
hometeamQ2 = float(_unpack(linescore_rows[2])[2]) 
hometeamQ3 = float(_unpack(linescore_rows[2])[3]) 
hometeamQ4 = float(_unpack(linescore_rows[2])[4]) # add OT rows if ??? 
hometeamFinal = float(_unpack(linescore_rows[2])[-3])  

misc_stats = tables[5] 
misc_stats_rows = misc_stats.findAll('tr') 
roadteam = str(_unpack(misc_stats_rows[0])[0]).strip() 
hometeam = str(_unpack(misc_stats_rows[0])[1]).strip() 
datefinder = tables[6] 
datefinder_rows = datefinder.findAll('tr') 

date = str(_unpack(datefinder_rows[0])[0]).strip() 
year = 2012 
from dateutil.parser import parse 
parsedDate = parse(date) 
date = parsedDate.replace(year) 
month = parsedDate.month 
day = parsedDate.day 
modDate = str(day)+str(month)+str(year) 
gameid = modDate + roadteam + hometeam 

data = {'roadteam': [roadteam], 
     'hometeam': [hometeam], 
     'roadQ1': [roadteamQ1], 
     'roadQ2': [roadteamQ2], 
     'roadQ3': [roadteamQ3], 
     'roadQ4': [roadteamQ4], 
     'homeQ1': [hometeamQ1], 
     'homeQ2': [hometeamQ2], 
     'homeQ3': [hometeamQ3], 
     'homeQ4': [hometeamQ4]} 

globals()["%s" % gameid] = pd.DataFrame(data) 
df = pd.DataFrame.load('df') 
df = pd.concat([df, globals()["%s" % gameid]]) 
df.save('df') 

我如何可以自動完成這一所以我沒有[1]手動手動更改newsoup =頁面並颳去所有從第一URL一氣呵成鏈接的boxscores的。我對Python非常陌生,缺乏對基礎知識的一些理解。

+0

爲什麼你必須手動改變呢?所以像頁面[2],頁面[3],..? –

+0

我只知道如何一次導入它們一個。 – user2333196

回答

1

所以在第一代碼框中您收集pages

所以,你必須循環這個第二個代碼中,如果我的理解是

for page in pages: 
    soup = BeautifulSoup(page) 
    # rest of the code here 
+0

我會嘗試。我需要暫停嗎?如果是這樣,我該怎麼做? – user2333196

+0

暫停?我不知道,爲什麼你應該這樣做。但如果你想要的話,可以使用'raw_input('some prompt:')',這樣它就會等到你輸入 –