我想解析像http://www.sports-reference.com/cbb/boxscores/2014-11-14-kentucky.html這樣的頁面的籃球統計數據。我使用Python 2.7.6和BeautifulSoup 4-4.3.2。我正在搜索類似上面的頁面的遊戲日誌,以便查找包含在表格中的原始狀態數據的「可排序」類。我只對每個團隊的「基本統計」感興趣。BeautifulSoup返回不相關的HTML
但是,BeautifulSoup返回的HTML並不是我所期望的。相反,我得到了曾經參與過的每個學校的歷史團隊記錄和數據清單。我沒有足夠的聲望在這裏發佈第二個鏈接,或者我會。
基本上,boxscore頁面上有四類「可排序」表格。當我要求BS通過唯一可以想到的方式來查找它們以區別於其他數據時,它會返回完全不相關的數據,我甚至無法確定返回的數據來自哪裏。
下面的代碼:
import urllib2
import re
import sys
from bs4 import BeautifulSoup
class Gamelogs():
def __init__(self):
#the base bage that has all boxscore links
self.teamPageSoup = BeautifulSoup(urllib2.urlopen(
'http://www.sports-reference.com/cbb/schools/' + school +
'/2015-gamelogs.html'))
#use regex to only find links with score data
self.statusPageLinks = self.teamPageSoup.findAll(href=re.compile(
"boxscores"));
def scoredata(links, school):
#for each link in the school's season
for l in links:
gameSoup = BeautifulSoup(urllib2.urlopen(l))
#remove extra link formatting to get just filename alone
l = l[59+len(school):]
#open a local file with that filename to store the results
fo = open(str(l),"w")
#create a list that will hold the box score data only
output = gameSoup.findAll(class_="sortable")
#write it line by line to the file that was just opened
for o in output:
fo.write(str(o) + '\n')
fo.close
def getlinks(school):
gamelogs = Gamelogs()
#open a new file to store the output
fo = open(school + '.txt',"w")
#remove extraneous links
gamelogs.statusPageLinks = gamelogs.statusPageLinks[2:]
#create the list that will hold each school's seasonlong boxscores
boxlinks = list()
for s in gamelogs.statusPageLinks:
#make the list element a string so it can be sliced
string = str(s)
#remove extra link formatting
string = string[9:]
string = string[:-16]
#create the full list of games per school
boxlinks.insert(0, 'http://www.sports-reference.com/cbb/schools/'
+ school + string)
scoredata(boxlinks, school)
if __name__ == '__main__':
#for each school as a commandline argument
for arg in sys.argv[1:]:
school = arg
getlinks(school)
這是BS,我的代碼,或者網站有問題嗎? T
我們這裏的大多數人都希望看到更多的調試信息 - 在這個過程中的一些打印將會用很長的時間來回答你的問題。如果您無法更新OP(或創建一個包含所有相關信息的新帖子),請發表一個答案 – KevinDTimm