2014-02-05 67 views
0

所以我有一個表:用美麗的湯來分析蟒蛇表

而我只是試圖返回表對的JSON字符串,像這樣:

[["Pig A", "Straw"], ["Pig B", "Stick"], ["Pig C", "Brick"]] 

然而,我的代碼,我似乎無法擺脫的HTML標籤:

stable = soup.find('table') 

cells = [ ] 
rows = stable.findAll('tr') 
for tr in rows[1:4]: 
    # Process the body of the table 
    row = [] 
    td = tr.findAll('td') 
    #td = [el.text for el in soup.tr.finall('td')] 
    row.append(td[0]) 
    row.append(td[1]) 
    cells.append(row) 


return cells 

#eventually,我想這樣做:#小時 = json.dumps(細胞) #return^h

我的輸出是這樣的:

[[<td>Pig A</td>, <td>Straw</td>], [<td>Pig B</td>, <td>Stick</td>], [<td>Pig C</td>, <td>Brick</td>]]

回答

2

使用text屬性以獲取元素的內部文本:

row.append(td[0].text) 
row.append(td[1].text) 
+0

回答更新。試一試... – cvsguimaraes

+0

太棒了,對我來說,謝謝 ! – kegewe

+0

請告訴我哪一個爲你工作。請試試:) – cvsguimaraes

0

您可以嘗試使用lxml庫。

from lxml.html import fromstring 
import lxml.html as PARSER 

#data = open('example.html').read() # You can read it from a html file. 
#OR 
data = """ 
<table border="1" style="width: 100%"> 
    <caption></caption> 
    <col> 
    <col> 
    <tbody> 
<tr> 
    <td>Pig</td> 
    <td>House Type</td> 
</tr> 
<tr> 
    <td>Pig A</td> 
    <td>Straw</td> 
</tr> 
<tr> 
    <td>Pig B</td> 
    <td>Stick</td> 
</tr> 
<tr> 
    <td>Pig C</td> 
    <td>Brick</td> 
</tr> 
""" 
root = PARSER.fromstring(data) 
main_list = [] 

for ele in root.getiterator(): 
    if ele.tag == "tr": 
     text = ele.text_content().strip().split('\n') 
     main_list.append(text) 

print main_list 

輸出: [[ '豬', '房屋類型'],[ '豬A', '稻草'],[ '豬B', '棒'],[ '豬C', '磚']]