2016-02-12 36 views
1

我想從一堆html表格中提取一些數據價格。這些表格包含各種價格,當然表格數據標籤不包含任何有用的東西。查找兄弟元素的文本,其中原始元素與特定字符串匹配

<div id="item-price-data"> 
    <table> 
    <tbody> 
     <tr> 
     <td class="some-class">Normal Price:</td> 
     <td class="another-class">$100.00</td> 
     </tr> 
     <tr> 
     <td class="some-class">Member Price:</td> 
     <td class="another-class">$90.00</td> 
     </tr> 
     <tr> 
     <td class="some-class">Sale Price:</td> 
     <td class="another-class">$80.00</td> 
     </tr> 
     <tr> 
     <td class="some-class">You save:</td> 
     <td class="another-class">$20.00</td> 
     </tr> 
    </tbody> 
    </table> 
</div> 

我唯一關心的價格是那些與具有「正常價格」的元素配對的價格,因爲它是文本。

我想要做的是掃描表的後代,找到包含該文本的<td>標籤,然後從其兄弟中拉出文本。

我遇到的問題是,在BeautifulSoup descendants屬性返回的列表NavigableString,而不是Tag

所以,如果我這樣做:

from bs4 import BeautifulSoup 
from urllib import request 

html = request.urlopen(url) 
soup = BeautifulSoup(html, 'lxml') 

div = soup.find('div', {'id': 'item-price-data'}) 
table_data = div.find_all('td') 

for element in table_data: 
    if element.get_text() == 'Normal Price:': 
     price = element.next_sibling 

print(price) 

我什麼也沒得到。有沒有簡單的方法來獲取字符串值?

+0

我只是跑這和我' $ 100.00';我錯過了什麼嗎? –

+0

是的。有些事我也沒有得到。我發現'Tag'在那裏,但它不是下一個兄弟姐妹。下一個兄弟是回車。 –

回答

0

可以使用find_next()方法還可能需要一點正則表達式:

演示:

>>> import re 
>>> from bs4 import BeautifulSoup 
>>> html = """<div id="item-price-data"> 
... <table> 
...  <tbody> 
...  <tr> 
...   <td class="some-class">Normal Price:</td> 
...   <td class="another-class">$100.00</td> 
...  </tr> 
...  <tr> 
...   <td class="some-class">Member Price:</td> 
...   <td class="another-class">$90.00</td> 
...  </tr> 
...  <tr> 
...   <td class="some-class">Sale Price:</td> 
...   <td class="another-class">$80.00</td> 
...  </tr> 
...  <tr> 
...   <td class="some-class">You save:</td> 
...   <td class="another-class">$20.00</td> 
...  </tr> 
...  </tbody> 
... </table> 
... </div>""" 
>>> soup = BeautifulSoup(html, 'lxml') 
>>> div = soup.find('div', {'id': 'item-price-data'}) 
>>> for element in div.find_all('td', text=re.compile('Normal Price')): 
...  price = element.find_next('td') 
...  print(price) 
... 
<td class="another-class">$100.00</td> 

如果你不希望把正則表達式這個那麼下面會爲你工作。

>>> table_data = div.find_all('td') 
>>> for element in table_data: 
...  if 'Normal Price' in element.get_text(): 
...   price = element.find_next('td') 
...   print(price) 
... 
<td class="another-class">$100.00</td> 
相關問題