2016-07-17 14 views
2

我正在處理一個小項目,並且我很難從使用bs4的html代碼解析所需的行。在python中使用Beautifulsoop4解析內部td標記

HTML:

<div id="results_box"> 
<table class="genTbl closedTbl historicalTbl" id="curr_table"> 
    <thead> 
     <tr> 
      <th class="first left noWrap">Date</th> 
      <th class="noWrap">Price</th> 
      <th class="noWrap">Open</th> 
      <th class="noWrap">High</th> 
      <th class="noWrap">Low</th> 
      <th class="noWrap">Vol.</th>   <th class="noWrap">Change %</th> 
     </tr> 
    </thead> 
    <tbody> 
      <tr> 
      <td class="first left bold noWrap">Jul 15, 2016</td> 
      <td class="redFont">98.78</td> 
      <td>99.02</td> 
      <td>99.30</td> 
      <td>98.51</td> 
      <td>30.14M</td>   <td class="bold redFont">-0.01%</td> 
     </tr> 
       <tr> 
      <td class="first left bold noWrap">Jul 14, 2016</td> 
      <td class="greenFont">98.79</td> 
      <td>97.39</td> 
      <td>98.99</td> 
      <td>97.32</td> 
      <td>38.92M</td>   <td class="bold greenFont">1.98%</td> 
     </tr> 

我需要從以下兩行

<td class="bold redFont">-0.01%</td> 
<td class="bold greenFont">1.98%</td> 

我用

txt = parsed_html.find("table", {"id":"curr_table"}).find_all("td", {"class":re.compile('bold .*Font')}) 
for row in txt: 
    L.append(row.text) 
print(L) 

,但我得到一個空的列表中提取-0.01%和1.98% 。任何解決方案或其他建議?

回答

2

的原因,您目前的方法是行不通的是,class是一個特殊的多值屬性BeautifulSoup和正則表達式將不會被應用到完整的屬性,而是單獨的類來代替,這個線程應該解釋它的詳細信息:

實際上,你可以避免檢查類值,相反,只要抓住有01 td元素在文本的末尾:

table = parsed_html.find("table", {"id":"curr_table"}) 
for td in table.find_all("td", text=lambda text: text and text.endswith('%')): 
    print(td.get_text()) 

我會實際使用​​這種格式良好的表解析成數據幀,這是相當方便的工作。 pandas提供了一個extensive documentation,幫助您瞭解如何使用數據幀的工作:

import pandas as pd 

data = """ 
<table class="genTbl closedTbl historicalTbl" id="curr_table"> 
    <thead> 
     <tr> 
      <th class="first left noWrap">Date</th> 
      <th class="noWrap">Price</th> 
      <th class="noWrap">Open</th> 
      <th class="noWrap">High</th> 
      <th class="noWrap">Low</th> 
      <th class="noWrap">Vol.</th>   <th class="noWrap">Change %</th> 
     </tr> 
    </thead> 
    <tbody> 
      <tr> 
      <td class="first left bold noWrap">Jul 15, 2016</td> 
      <td class="redFont">98.78</td> 
      <td>99.02</td> 
      <td>99.30</td> 
      <td>98.51</td> 
      <td>30.14M</td>   <td class="bold redFont">-0.01%</td> 
     </tr> 
       <tr> 
      <td class="first left bold noWrap">Jul 14, 2016</td> 
      <td class="greenFont">98.79</td> 
      <td>97.39</td> 
      <td>98.99</td> 
      <td>97.32</td> 
      <td>38.92M</td>   <td class="bold greenFont">1.98%</td> 
     </tr> 
    </tbody> 
</table> 
""" 

df = pd.read_html(data)[0] 
print(df) 

print("----") 
print(df['Change %'].tolist()) 

打印:

  Date Price Open High Low Vol. Change % 
0 Jul 15, 2016 98.78 99.02 99.30 98.51 30.14M -0.01% 
1 Jul 14, 2016 98.79 97.39 98.99 97.32 38.92M 1.98% 
---- 
['-0.01%', '1.98%']