2016-08-12 57 views
1

我在Python中使用BeautifulSoup4來解析一些HTML代碼。我設法鑽取到正確的表格並識別td標籤,但是我面臨的問題是標籤中的style屬性不一致地應用,並且使獲取正確td標籤的任務成爲真正的挑戰。如何在HTML代碼不一致時使用python中的bs4標識正確的td標籤

我試圖拉的數據是一個日期字段,但任何時候都會有多個使用CSS隱藏的td標籤(可見的取決於在HTML代碼中選擇的其他選項值)。

實際例子:

<td style="display: none;">01/03/2016</td> 
<td style="display: table-cell;">27/10/2015</td> <-- this is the tag I want 

<td style="display:none">23/02/2016</td> 
<td style="">09/05/2011</td> <-- this is the tag I want 
<td style="display: none;">29/03/2011</td> 
<td style="display:none">19/10/2010</td> 

<td>27/10/2015</td> <-- this is the tag I want 
<td style="display: none">01/03/2016</td> 
<td style="display: none">22/03/2016</td> 

<td style="display:none">11/04/2015</td> 
<td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want 
<td style="display: none">18/10/2013</td> 

我將如何排除/刪除不正確的項目(其樣式爲display:nonedisplay: none)以使我實際上想要的項目離開我?

回答

1

篩選使用列表補償的TDS,只保留如果TD沒有在集合{"display:none", "display: none;","display: none;","display: none"}一個樣式屬性:

In [8]: h1 = """"<td style="display: none;">01/03/2016</td> 
    ...: <td style="display: table-cell;">27/10/2015</td>""" 

In [9]: h2 = """"<td style="display:none">23/02/2016</td> 
    ...: <td style="">09/05/2011</td> <-- this is the tag I want 
    ...: <td style="display: none;">29/03/2011</td> 
    ...: <td style="display:none">19/10/2010</td>""" 

In [10]: h3 = """"<td>27/10/2015</td> <-- this is the tag I want 
    ....: <td style="display: none">01/03/2016</td> 
    ....: <td style="display: none">22/03/2016</td>""" 

In [11]: h4 = """<td style="display:none">11/04/2015</td> 
    ....: <td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want 
    ....: <td style="display: none">18/10/2013</td>""" 

In [12]: ignore = {"display:none", "display: none;", "display: none;", "display: none"} 

In [13]: for html in [h1, h2, h3, h4]: 
    ....:   soup = BeautifulSoup(html, "html.parser") 
    ....:   print([td for td in soup.find_all("td") if not td.get("style") in ignore]) 
    ....:  
[<td style="display: table-cell;">27/10/2015</td>] 
[<td style="">09/05/2011</td>] 
[<td>27/10/2015</td>] 
[<td style="display: table-cell;">02/02/2016</td>] 
+0

非常感謝,這個完美工作 – Matt