2016-12-13 52 views
0

我正在關注this questionthis other one以解析維基百科中的表格。解析HTML tr會返回空列表

具體而言,我想只獲取所有行,並在每行內轉儲每列的內容。

我的代碼使用MacOS X下的xml庫,但我得到的是一個行的空列表

import xml.etree.ElementTree 

s = open("wikiactors20century.txt", "r").read() 

# tree = xml.etree.ElementTree.fromstring(s) 
# rows = tree.findall() 
# headrow = rows[0] 
# datarows = rows[1:] 
# 
# for num, h in enumerate(headrow): 
#  data = ", ".join([row[num].text for row in datarows]) 
#  print "{0:<16}: {1}".format(h.text, data) 

table = xml.etree.ElementTree.XML(s) 
rows = iter(table) 
headers = [col.text for col in next(rows)] 
for row in rows: 
    values = [col.text for col in row] 
    print dict(zip(headers, values)) 

輸入文件has been pasted here in PasteBinxml.etree.ElementTree.fromstringxml.etree.ElementTree.XML版本都無法檢索行列表。但是,如果我製作一個虛擬表,如下所示

s = "<table> <tr><td>a</td><td>1</td></tr> <tr><td>b</td><td>2</td></tr> <tr><td>c</td><td>3</td></tr> </table>" 

然後解析工作正常。

我在做什麼錯?解析文件之前是否需要進行一些清理?

回答

1

您的嘗試不具有像維基百科示例一樣的結構。

>>> list(table) 
[<Element 'thead' at 0x7ff0fdb73f50>, <Element 'tbody' at 0x7ff0fdb78590>, <Element 'tfoot' at 0x7ff0fb995a90>] 

你可以得到頭與NAME:

>>> columns = list(k.text for k in table[0][0]) 

,然後每行ITER建數據表:

>>> data_table = list(dict(zip(columns, list(v.text for v in row))) for row in table[1]) 
>>> print(json.dumps(data_table, indent=2)) 
[ 
    { 
    "L,S": "L", 
    "Cause of death": "~", 
    "null": "F", 
    "Noms": "1", 
    "Wins": "0", 
    "Age": "26", 
    "Actor": null, 
    "Born": "1990", 
    "Film": null, 
    "Last": "~", 
    "WoF": "~", 
    "Died": "~", 
    "First": "2001" 
    }, 
    { 
    "L,S": "1L,1S", 
    "Cause of death": "~", 
    "null": "M", 
    "Noms": "2", 
    "Wins": "0", 
    "Age": "39", 
    "Actor": null, 
    "Born": "1977", 
    "Film": null, 
    "Last": "~", 
    "WoF": "~", 
    "Died": "~", 
    "First": "2001" 
    }, 

[...] 

注:有是有聯繫的一些解析的問題,內部標籤。它可以用itertext或更深的解析來解決。