如何使用Python提取網頁數據，BeautiflSoup和從表格中機械化

我想從該網站的表格中提取數據：http://www.pgatour.com/r/stats/info/xm.html?101 然後將其保存爲.csv並將其帶入iWorks Numbers表單。我一直在嘗試與Python，BeautifulSoup和機械化。通過查看其他例子，我一直在沒有知識的情況下嘗試，但沒有成功。我已經走到這一步：如何使用Python提取網頁數據，BeautiflSoup和從表格中機械化

from BeautifulSoup import BeautifulSoup, SoupStrainer 
from mechanize import Browser 
import re 
br = Browser() 
response = br.open("http://www.pgatour.com/r/stats/info/xm.html?101").read()

然後我看與螢火蟲的代碼，我猜我需要解析的是<tbody>和</tbody>之間的數據。但我不知道該怎麼做。任何幫助非常感謝。

來源

2011-08-15 Mikael

如果您的問題已被解答，請點擊接受。 – smci

在主頁，旅遊統計數據似乎正在由JavaScript填充<div class="tourViewData"> ... populateDDs(); BS不解析JavaScript中，看到了許多其他的做題。作爲一種解決方法，選擇並保存該HTML選項作爲本地HTML文件，作爲解決方法。）

首先，將s設置爲該URL的BeautifulSoup對象（我用斜紋不生機械化，在這裏把你的機械化當量）：

from BeautifulSoup import BeautifulSoup, SoupStrainer 
#from mechanize import Browser 
from twill.commands import * 
import re 

go("http://www.pgatour.com/r/stats/info/xm.html?101") 
s = BeautifulSoup(get_browser().get_html())

反正你要找的統計的表是標有<tbody><tr class="tourStatTournHead">表。只是爲了讓事情有點古怪，其行中的標籤屬性交替定義爲<tr class="tourStatTournCellAlt"或<tr class=""...。我們應該搜索第一個<tr class="tourStatTournCellAlt"，然後在表中處理每個<tr>，除了標題行（<tr class="tourStatTournHead">）之外。

要通過行迭代：（它可能會或可能不會是分層的，如果它嵌入了Titleist品牌標誌）

tbl = s.find('table', {'class':'tourStatTournTbl'}) 

def extract_text(ix,tg): 
    if ix==2: # player name field, may be hierarchical 
     tg = tg.findChildren()[0] if (len(tg.findChildren())>0) else tg 
    return tg.text.encode() 

for rec in tbl.findAll('tr'): # {'class':'tourStatTournCellAlt'}): 
    # Skip header rows 
    if (u'tourStatTournHead' in rec.attrs[0]): 
     continue   
    # Extract all fields 
    (rank_tw,rank_lw,player,rounds,avg,tot_dist,tot_drives) = \ 
     [extract_text(i,t) for (i,t) in enumerate(rec.findChildren(recursive=False))] 
    # ... do stuff

我們增加一個輔助功能，供玩家名稱也許你想將大多數字段轉換爲除player（string）和avg（float）之外的int（）;如果是這樣，請記住從等級字段中去除可選的'T'（用於綁定），並從tot_dist中去掉逗號。

來源

2011-08-16 01:33:00 smci

感謝您的努力和時間！試圖輸入你的代碼我得到這個錯誤： – Mikael

謝謝你的努力和時間！試圖輸入你的代碼我得到這個錯誤：>>> tbl = s.find（'table'，{'class'：'tourStatTournTbl'}）回溯（最近呼叫最後）：文件「」，第1行，在 NameError：name's'沒有定義我猜你的代碼很好，這是我做錯了什麼。我不想浪費你的時間了。我想我需要在嘗試這個之前正確學習Python。我將複製你的代碼並研究一些更多的python，稍後再嘗試。非常感謝你！！！ – Mikael

s應該是該URL的BeautifulSoup對象。（我用斜紋不機械化。） – smci

如何使用Python提取網頁數據，BeautiflSoup和從表格中機械化

回答

相關問題