用BS4解析HTML表格

我一直在嘗試不同的方法從這個網站上抓取數據（http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=WR&college=），並且似乎無法讓他們工作。我試着玩指數，但似乎無法使它工作。我認爲在這一點上我已經嘗試了太多的東西，所以如果有人能指出我朝着正確的方向，我會非常感激。用BS4解析HTML表格

我想拉出所有信息並將其導出到.csv文件，但此時我只是試圖獲取要打印的名稱和位置以便開始使用。

這裏是我的代碼：

import urllib2 
from bs4 import BeautifulSoup 
import re 

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=') 

page = urllib2.urlopen(url).read() 

soup = BeautifulSoup(page) 
table = soup.find('table') 

for row in table.findAll('tr')[0:]: 
    col = row.findAll('tr') 
    name = col[1].string 
    position = col[3].string 
    player = (name, position) 
    print "|".join(player)

這裏是我得到的錯誤：線14，在名稱= COL [1] .string IndexError：列表索引超出範圍。

--UPDATE--

好吧，我做了一個小的進步。它現在允許我從頭到尾去做，但它需要知道表中有多少行。我如何才能把它貫穿到底？更新的代碼：

import urllib2 
from bs4 import BeautifulSoup 
import re 

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=') 

page = urllib2.urlopen(url).read() 

soup = BeautifulSoup(page) 
table = soup.find('table') 


for row in table.findAll('tr')[1:250]: 
    col = row.findAll('td') 
    name = col[1].getText() 
    position = col[3].getText() 
    player = (name, position) 
    print "|".join(player)

來源

2014-02-27 ISuckAtLife

我只在8個小時左右就知道了。學習很有趣。感謝凱文的幫助！它現在包含將抓取的數據輸出到csv文件的代碼。接下來是採取這一數據，並過濾掉某些職位....

這裏是我的代碼：

import urllib2 
from bs4 import BeautifulSoup 
import csv 

url = ('http://nflcombineresults.com/nflcombinedata.php?year=2000&pos=&college=') 

page = urllib2.urlopen(url).read() 

soup = BeautifulSoup(page) 
table = soup.find('table') 

f = csv.writer(open("2000scrape.csv", "w")) 
f.writerow(["Name", "Position", "Height", "Weight", "40-yd", "Bench", "Vertical", "Broad", "Shuttle", "3-Cone"]) 
# variable to check length of rows 
x = (len(table.findAll('tr')) - 1) 
# set to run through x 
for row in table.findAll('tr')[1:x]: 
    col = row.findAll('td') 
    name = col[1].getText() 
    position = col[3].getText() 
    height = col[4].getText() 
    weight = col[5].getText() 
    forty = col[7].getText() 
    bench = col[8].getText() 
    vertical = col[9].getText() 
    broad = col[10].getText() 
    shuttle = col[11].getText() 
    threecone = col[12].getText() 
    player = (name, position, height, weight, forty, bench, vertical, broad, shuttle, threecone,) 
    f.writerow(player)

來源

2014-02-28 12:50:10 ISuckAtLife

我不能由於防火牆的權限運行腳本，但我相信這個問題是在這條線：

col = row.findAll('tr')

row是tr標籤，而你要求BeautifulSoup找到tr標籤內的所有tr標籤。你大概的意思做：

col = row.findAll('td')

此外，由於實際的文本沒有直接的TDS內部，但也隱藏嵌套div S和a秒鐘內，它可能是使用getText方法有用而不是.string：

name = col[1].getText() 
position = col[3].getText()

來源

2014-02-27 19:52:04 Kevin

啊，這是有道理的。謝謝！好吧，我做了你所建議的改變，並且在頁面上打印大部分結果的時候肯定會取得進展。它始於Adrian Dingle，但不是列中的第一個名字，而是在包含|後打印完整列表和位置。然後它返回這個錯誤：文件「nfltest.py」，第14行，在 name = col [1] .getText（）IndexError：列表索引超出範圍。再一次，我試着玩索引，似乎無法擺脫錯誤。這只是我，還是這個表奇怪的格式？ – ISuckAtLife

用BS4解析HTML表格

回答

相關問題