在Python中使用BeautifulSoup在HTML段落中提取文本

<p> 
    <a name="533660373"></a> 
    <strong>Title: Point of Sale Threats Proliferate</strong><br /> 
    <strong>Severity: Normal Severity</strong><br /> 
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br /> 
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br /> 
    <em>Analysis: Emboldened by past success and media attention, threat actors ..</em> 
    <br /> 
</p>

這是我想從使用Python中的BeautifulSoup的HTML頁面中提取的段落。我能夠使用.children & .string方法獲取標籤內的值。但是我無法得到文本「幾個新的銷售點惡意軟件fa ...」，這是在沒有任何標籤的段內。我嘗試使用soup.p.text，.get_text（）等..但沒有用。在Python中使用BeautifulSoup在HTML段落中提取文本

來源

2014-12-24 remis haroon

使用find_all()與text=True查找所有文本節點，並recursive=False到只在父p標記的直接子搜索：

from bs4 import BeautifulSoup 

data = """ 
<p> 
    <a name="533660373"></a> 
    <strong>Title: Point of Sale Threats Proliferate</strong><br /> 
    <strong>Severity: Normal Severity</strong><br /> 
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br /> 
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br /> 
    <em>Analysis: Emboldened by past success and media attention, threat actors ..</em> 
    <br /> 
</p> 
""" 

soup = BeautifulSoup(data) 
print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False))

打印：

Several new Point of Sale malware families have emerged recently, to include LusyPOS,..

來源

2014-12-24 05:38:42 alecxe

在Python中使用BeautifulSoup在HTML段落中提取文本

回答

相關問題