BeautifulSoup刪除嵌套標籤

我想使用BeautifulSoup通用刮板，我試圖檢測其下直接文本可用的標籤。BeautifulSoup刪除嵌套標籤

考慮這個例子：

<body> 
<div class="c1"> 
    <div class="c2"> 
     <div class="c3"> 
      <div class="c4"> 
       <div class="c5"> 
        <h1> A heading for section </h1> 
       </div> 
       <div class="c5"> 
        <p> Some para </p> 
       </div> 
       <div class="c5"> 
        <h2> Sub heading </h2> 
        <p> <span> Blah Blah </span> </p> 
       </div> 
      </div> 
     </div> 
    </div> 
</div> 
</body>

這裏我的目標是提取（帶班C4格），因爲它擁有所有的文本內容。在c1-c3之前的其餘部分對我來說只是包裝。用於識別節點

一個可能的方法，我想出了是：

if node.find(re.compile("^h[1-6]"), recursive=False) is not None: 
    return node.parent.parent

但它太具體了這種情況。

是否有任何優化的方式來查找一個遞歸級別的文本。即如果我做類似的事情

node.find(text=True, recursion_level=1)

那麼它應該只考慮直接的孩子返回文本。

我的解決方案到目前爲止，不知道它是否適用於所有情況。

def check_for_text(node): 
    return node.find(text=True, recursive=False) 

def check_1_level_depth(node): 
    if check_for_text(node): 
     return check_for_text(node) 

    return map(check_for_text, node.children)

對於上面的代碼：節點是湯的元素是目前檢查下，即DIV，跨度等此外，假設我在處理check_for_text）所有異常（（AttributeError的：「NavigableString」）

來源

2013-10-11 dpatro

您是否嘗試過使用CSS選擇器？ 'soup.select（「。c4」）' –

我無法使用css選擇器。因爲不同的網站會有不同的命名。 – dpatro

請注意，[標籤：數據挖掘]是指對大量數據進行復雜的統計分析。您可能意思是[標籤：網頁抓取]，即從網頁中提取文本。 –

原來我必須編寫一個遞歸函數來消除單個孩子的標籤。這裏是代碼：

# Pass soup.body in following 
def process_node(node): 
    if type(node) == bs4.element.NavigableString: 
     return node.text 
    else: 
     if len(node.contents) == 1: 
      return process_node(node.contents[0]) 
     elif len(node.contents) > 1: 
      return map(process_node, node.children)

到目前爲止，它工作的很好，很快。

來源

2013-10-16 01:42:24 dpatro

使用'if isinstance（node，bs4.element.NavigableString）：'而不是'if type（node）== bs4.element.NavigableString：' – dm295

我想你需要的是這樣的：

bs = BeautifulSoup(html) 
all = bs.findAll() 

previous_elements = [] 
found_element = None 

for i in all: 
    if not i.string: 
     previous_elements.append(i) 
    else: 
     found_element = i 
     break 

print("previous:") 
for i in previous_elements: 
    print(i.attrs) 

print("found:") 
print(found_element)

輸出：

previous: 
{} 
{'class': ['c1']} 
{'class': ['c2']} 
{'class': ['c3']} 
{'class': ['c4']} 
found: 
<h1> A heading for section </h1>

來源

2013-10-11 18:40:15

不像所示的例子，可能有多個div。同樣在找到帶有字符串的元素後，我應該返回元素的父對嗎？ – dpatro

然後你可以重複所有的div –

而不是在c5上休息，保存並繼續 –

BeautifulSoup刪除嵌套標籤

回答

相關問題