BeautifulSoup返回不相關的HTML

我想解析像http://www.sports-reference.com/cbb/boxscores/2014-11-14-kentucky.html這樣的頁面的籃球統計數據。我使用Python 2.7.6和BeautifulSoup 4-4.3.2。我正在搜索類似上面的頁面的遊戲日誌，以便查找包含在表格中的原始狀態數據的「可排序」類。我只對每個團隊的「基本統計」感興趣。BeautifulSoup返回不相關的HTML

但是，BeautifulSoup返回的HTML並不是我所期望的。相反，我得到了曾經參與過的每個學校的歷史團隊記錄和數據清單。我沒有足夠的聲望在這裏發佈第二個鏈接，或者我會。

基本上，boxscore頁面上有四類「可排序」表格。當我要求BS通過唯一可以想到的方式來查找它們以區別於其他數據時，它會返回完全不相關的數據，我甚至無法確定返回的數據來自哪裏。

下面的代碼：

import urllib2 
import re 
import sys 
from bs4 import BeautifulSoup 

class Gamelogs(): 

    def __init__(self): 

     #the base bage that has all boxscore links 
     self.teamPageSoup = BeautifulSoup(urllib2.urlopen(
     'http://www.sports-reference.com/cbb/schools/' + school + 
     '/2015-gamelogs.html')) 
     #use regex to only find links with score data  
     self.statusPageLinks = self.teamPageSoup.findAll(href=re.compile(
     "boxscores")); 

def scoredata(links, school): 
    #for each link in the school's season 
    for l in links: 
     gameSoup = BeautifulSoup(urllib2.urlopen(l)) 
     #remove extra link formatting to get just filename alone 
     l = l[59+len(school):] 
     #open a local file with that filename to store the results 
     fo = open(str(l),"w") 
     #create a list that will hold the box score data only 
     output = gameSoup.findAll(class_="sortable") 
     #write it line by line to the file that was just opened 
     for o in output: 
      fo.write(str(o) + '\n') 
     fo.close 

def getlinks(school): 
    gamelogs = Gamelogs() 
    #open a new file to store the output 
    fo = open(school + '.txt',"w") 
    #remove extraneous links 
    gamelogs.statusPageLinks = gamelogs.statusPageLinks[2:] 
    #create the list that will hold each school's seasonlong boxscores 
    boxlinks = list() 
    for s in gamelogs.statusPageLinks: 
     #make the list element a string so it can be sliced 
     string = str(s) 
     #remove extra link formatting 
     string = string[9:] 
     string = string[:-16] 
     #create the full list of games per school 
     boxlinks.insert(0, 'http://www.sports-reference.com/cbb/schools/' 
     + school + string) 
    scoredata(boxlinks, school)  

if __name__ == '__main__': 
    #for each school as a commandline argument 
    for arg in sys.argv[1:]: 
     school = arg  
     getlinks(school)

這是BS，我的代碼，或者網站有問題嗎？ T

來源

2015-04-28 hanasu

我們這裏的大多數人都希望看到更多的調試信息 - 在這個過程中的一些打印將會用很長的時間來回答你的問題。如果您無法更新OP（或創建一個包含所有相關信息的新帖子），請發表一個答案 – KevinDTimm

看起來像這是你的代碼的問題。返回的頁面聽起來像這樣：http://www.sports-reference.com/cbb/schools/?redir

每當我輸入一個無效的學校名稱，我都會重定向到顯示477個不同團隊的統計信息的頁面。僅供參考：網址中的團隊名稱也區分大小寫。

來源

2015-04-28 16:19:04 afinit

我會研究這一點，謝謝。 – hanasu

這是問題所在，當我將每個鏈接插入到'boxlinks'時，我的格式不正確，我誤讀了正確的鏈接。我只需要從該聲明中刪除'/ cbb/schools /'和'school'。我非常關注爲什麼我得到這樣看似隨機的數據，而不是調試它看起來的根源。謝謝。 – hanasu

BeautifulSoup返回不相關的HTML

回答

相關問題