Python beautifulsoup刪除所有標記/內容與特定的標記和文本以下

我在Python中使用beautifulsoup，並希望從一個字符串中刪除所有內容在一個特定的標記，並具有特定的非結束標記以下特定的文本它。在這個例子中，我想刪除所有DOCA文本中包含類型標籤的文檔。Python beautifulsoup刪除所有標記/內容與特定的標記和文本以下

比方說，我有這樣的事情：

<body> 
    <document> 
     <type>DOCA 
      <sequence>1 
      <filename>DOCA.htm 
      <description>FORM DOCA 
      <text> 
       <title>Form DOCA</title> 
       <h5 align="left"><a href="#toc">Table of Contents</a></h5> 
    </document> 
    <document> 
     <type>DOCB 
     <sequence>1 
     <filename>DOCB.htm 
     <description>FORM DOCB 
     <text> 
      <title>Form DOCB</title> 
      <h5 align="left"><a href="#toc">Table of Contents</a></h5> 
    </document> 
<body>

我想要做的是去除所有<document>有的<type> DOCA。我曾嘗試以下，但它不工作：

>>print(soup.find('document').find('type', text = re.compile('DOCA.*'))) 
None

來源

2017-07-07 cullan

您可以查詢所有文檔，然後，在每個文檔中，查詢所有類型，檢查是否存在其中任何DOCA，並刪除整個如果包含文件，則附上文件。

from bs4 import BeautifulSoup 

soup = BeautifulSoup(..., 'html.parser') 

for doc in soup.find_all('document'): 
    for type in doc.find_all('type'): 
     if 'DOCA' in type.text: 
      doc.extract() 
      break 

print(soup)

輸出：

<body> 

<document> 
<type>DOCB 
     <sequence>1 
     <filename>DOCB.htm 
     <description>FORM DOCB 
     <text> 
<title>Form DOCB</title> 
<h5 align="left"><a href="#toc">Table of Contents</a></h5> 
</text></description></filename></sequence></type></document> 
</body>

來源

2017-07-07 15:14:34

您可以使用lambda在find方法來選擇一個元素，例如：

soup.find('document').find(lambda tag : tag.name == 'type' and 'DOCA' in tag.text)

然後你可以使用extract或decompose移除元素。

編輯：用這句話來選擇所有元素：

soup.find_all(lambda tag:tag.name == 'document' 
    and tag.find(lambda t:t.name == 'type' and 'DOCA' in t.text))

來源

2017-07-07 15:31:37

哪種方法將是大文件速度更快？使用這個lambda，或循環通過@COLDSPEED的答案？ – cullan

我不確定。 @COLDSPEED使用2個循環，應該慢一點。另一方面，他的代碼會立即刪除元素並生成一個清潔的湯對象，而我的代碼會生成一個不需要的項目列表 –

@cullan 1.23 ms（mine）vs 1.33 ms +（其他開銷來刪除東西）（adam's） –

Python beautifulsoup刪除所有標記/內容與特定的標記和文本以下

回答

相關問題