2013-10-18 173 views
2

我有一個非常長的html文件,看起來完全像這樣 - html file。我希望能夠解析文件,以便獲取元組表單中的信息。用Python解析HTML文件

實施例:

<tr> 
     <td>Cech</td> 
     <td>Chelsea</td> 
     <td>30</td> 
     <td>£6.4</td> 
</tr> 

上述信息將會看起來像("Cech", "Chelsea", 30, 6.4)。但是,如果你仔細看看我發佈的link,我發佈的html示例是在<h2>Goalkeepers</h2>標籤下。我也需要這個標籤。所以基本上結果元組看起來像("Cech", "Chelsea", 30, 6.4, Goalkeepers)。進一步向下的文件中,一羣球員進入中場,後衛和前鋒的標記<h2>

我嘗試使用beautifulsoup和ntlk庫並迷路了。所以,現在我有以下代碼:

import nltk 
from urllib import urlopen 

url = "http://fantasy.premierleague.com/player-list/" 
html = urlopen(url).read() 
raw = nltk.clean_html(html) 
print raw 

剛剛去掉所有標記的HTML文件,並給出了這樣的事:

  Cech 
      Chelsea 
      30 
      £6.4 

雖然我可以寫一個糟糕的一段代碼,讀取每一行,並可以將其分配給一個元組。我不能想出任何可以結合玩家位置的解決方案(存在於<h2>標籤中的字符串)。任何解決方案/建議將不勝感激。

我傾向於使用元組的原因是我可以使用解包並計劃使用解包值填充MySQl表。

+0

我想你現在看到的,在回答的光,使ntlk是爲錯誤的工作工具。 – msw

+0

我嘗試玩nltk,因爲我很難使用它。它看起來很容易,但給了我一個遞歸錯誤。花了一段時間才明白問題是什麼 –

回答

2
from bs4 import BeautifulSoup 
from pprint import pprint 

soup = BeautifulSoup(html) 
h2s = soup.select("h2") #get all h2 elements 
tables = soup.select("table") #get all tables 

first = True 
title ="" 
players = [] 
for i,table in enumerate(tables): 
    if first: 
     #every h2 element has 2 tables. table size = 8, h2 size = 4 
     #so for every 2 tables 1 h2 
     title = h2s[int(i/2)].text 
    for tr in table.select("tr"): 
     player = (title,) #create a player 
     for td in tr.select("td"): 
      player = player + (td.text,) #add td info in the player 
     if len(player) > 1: 
      #If the tr contains a player and its not only ("Goalkeaper") add it 
      players.append(player) 
    first = not first 
pprint(players) 

輸出

[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'), 
('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'), 
('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'), 
('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'), 
('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'), 
('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'), 
('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'), 
('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'), 
('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'), 
('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'), 
('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'), 
('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'), 
('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'), 
('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'), 
('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'), 
('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'), 
('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'), 
('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'), 
('Defenders', 'Baines', 'Everton', '43', '£7.7'), 
('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'), 
('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'), 
('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'), 
('Defenders', 'Davies', 'Hull City', '28', '£4.5'), 
('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'), 
('Defenders', 'Dawson', 'West Brom', '0', '£3.9'), 
('Defenders', 'Potts', 'West Ham', '0', '£3.9'), 
('Defenders', 'Spence', 'West Ham', '0', '£3.9'), 
('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'), 
('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'), 
('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'), 
('Midfielders', 'Gera', 'West Brom', '0', '£5.0'), 
('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'), 
('Midfielders', 'Brown', 'West Brom', '0', '£4.3'), 
('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'), 
('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'), 
('Forwards', 'Elmander', 'Norwich', '7', '£5.4'), 
('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'), 
('Forwards', 'Vydra', 'West Brom', '2', '£5.3'), 
('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')] 
+0

我不確定上面的代碼是什麼。我在使用ntlk模塊的問題中發佈的代碼與我們的代碼完全相同。實際上,我們的代碼甚至完全刪除了我的輸出中實際需要的Defender,Midfielders和Forwards標籤 –

+1

@ begin.py對不起missunderstood解決了它會在一分鐘內更新 –

+0

我認爲這就是您要找的?如果它令人困惑,讓我知道添加評論。 –