2011-05-31 31 views
2

我希望能夠選擇包含「應付賬款」文本的表格,但我沒有得到我想要的任何地方,我幾乎猜測使用findall。有人能告訴我我該怎麼做?使用Python和Beautifulsoup如何在div中選擇所需的表格?

例如,這是我開始:

<div> 
<tr> 
<td class="lft lm">Accounts Payable 
</td> 
<td class="r">222.82</td> 
<td class="r">92.54</td> 
<td class="r">100.34</td> 
<td class="r rm">99.95</td> 
</tr> 
<tr> 
<td class="lft lm">Accrued Expenses 
</td> 
<td class="r">36.49</td> 
<td class="r">33.39</td> 
<td class="r">31.39</td> 
<td class="r rm">36.47</td> 
</tr> 
</div> 

而這正是我希望得到的結果:

<tr> 
<td class="lft lm">Accounts Payable 
</td> 
<td class="r">222.82</td> 
<td class="r">92.54</td> 
<td class="r">100.34</td> 
<td class="r rm">99.95</td> 
</tr> 
+0

讓我知道是否有任何關於我的解決方案的具體問題您的問題。 – RedBlueThing 2011-06-02 07:51:33

回答

8

您可以選擇TD元素帶班lft lm然後檢查element.string以確定您是否擁有「應付賬款」td:

import sys 
from BeautifulSoup import BeautifulSoup 

# where so_soup.txt is your html 
f = open ("so_soup.txt", "r") 
data = f.readlines() 
f.close() 

soup = BeautifulSoup ("".join (data)) 

cells = soup.findAll('td', {"class" : "lft lm"}) 
for cell in cells: 
    # You can compare cell.string against "Accounts Payable" 
    print (cell.string) 

如果你想研究以下的兄弟姐妹爲應付賬款例如,你可以使用如下:

if (cell.string.strip() == "Accounts Payable"): 
    sibling = cell.findNextSibling() 
    while (sibling): 
     print ("\t" + sibling.string) 
     sibling = sibling.findNextSibling() 

更新編輯

如果你想打印出來原始HTML,僅適用於跟隨應付帳款元素的兄弟姐妹,這是此代碼:

lines = ["<tr>"] 
for cell in cells: 
    lines.append (cell.prettify().decode('ascii')) 
    if (cell.string.strip() == "Accounts Payable"): 
     sibling = cell.findNextSibling() 
     while (sibling): 
      lines.append (sibling.prettify().decode('ascii')) 
      sibling = sibling.findNextSibling() 
lines.append ("</tr>") 

f = open ("so_soup_out.txt", "wt") 
f.writelines (lines) 
f.close() 
相關問題