2014-12-24 159 views
3
<p> 
    <a name="533660373"></a> 
    <strong>Title: Point of Sale Threats Proliferate</strong><br /> 
    <strong>Severity: Normal Severity</strong><br /> 
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br /> 
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br /> 
    <em>Analysis: Emboldened by past success and media attention, threat actors ..</em> 
    <br /> 
</p> 

這是我想從使用Python中的BeautifulSoup的HTML頁面中提取的段落。 我能夠使用.children & .string方法獲取標籤內的值。 但是我無法得到文本「幾個新的銷售點惡意軟件fa ...」,這是在沒有任何標籤的段內。我嘗試使用soup.p.text,.get_text()等..但沒有用。在Python中使用BeautifulSoup在HTML段落中提取文本

回答

1

使用find_all()text=True查找所有文本節點,並recursive=False到只在父p標記的直接子搜索:

from bs4 import BeautifulSoup 

data = """ 
<p> 
    <a name="533660373"></a> 
    <strong>Title: Point of Sale Threats Proliferate</strong><br /> 
    <strong>Severity: Normal Severity</strong><br /> 
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br /> 
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br /> 
    <em>Analysis: Emboldened by past success and media attention, threat actors ..</em> 
    <br /> 
</p> 
""" 

soup = BeautifulSoup(data) 
print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False)) 

打印:

Several new Point of Sale malware families have emerged recently, to include LusyPOS,.. 
相關問題