2014-04-01 56 views
2

以這樣的基礎HTML爲例。在截斷並移除它之前,如何去除所有的子節點,比如深度2節點。刪除超過特定深度的所有兒童標記

<html> 
<head> 
    <title></title> 
    <meta /> 
    <meta /> 
    <link /> 
</head> 
<body> 
    <div> 
     <div> 
      <a></a> 
      <a></a> 
      <a></a> 
     </div> 
     <span> 
      <h1> 
       <li></li> 
       <li></li> 
      </h1> 
     </span> 
    </div> 
</body> 

將成爲類似:

<html> 
<head> 
    <title></title> 
    <meta /> 
    <meta /> 
    <link /> 
</head> 
<body> 
    <div> 
     <div></div> 
     <span></span> 
    </div> 
</body> 

回答

1

的想法是遞歸遍歷所有的元素和倒計時該家長:

from bs4 import BeautifulSoup 
from urllib2 import urlopen 


data = """your html goes here""" 

depth = 5 
soup = BeautifulSoup(data) 
for tag in soup.find_all(): 
    if len(list(tag.parents)) == depth: 
     tag.extract() 

print soup.prettify() 

打印:

<html> 
<head> 
    <title> 
    </title> 
    <meta/> 
    <meta/> 
    <link/> 
</head> 
<body> 
    <div> 
    <div></div> 
    <span></span> 
    </div> 
</body> 
</html> 
0

也許是這樣的:

for child in body.children: 
    for element in child.children: 
     element.clear()