2017-05-26 98 views
1

嘗試創建一個程序,該程序可以使用Beautiful Soup模塊在某些指定元素中查找和替換標籤。但是 - 我無法通過在元素字符串中找到的特定單詞「搜索」來查找如何「查找」這些元素。假設我可以讓我的代碼通過它們指定的單詞字符串「查找」這些元素,然後「解開」元素的「p」標籤並將它們「包裝」到它們的新「h1」標籤中。基於元素字符串中的特定單詞搜索HTML元素

下面是一些例子HTML代碼作爲輸入:

<p> ExampleStringWord#1 needs to 「find」 this entire element based on the "finding" of the first word </p> 
<p> Example#2 this element ignored </p> 
<p> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m 「searching」 for, even though the wording after the first word in the string is different <p> 

這裏是我到目前爲止的代碼(由「ExampleStringWord#1」搜索):

for h1_tag in soup.find_all(string="ExampleStringWord#1"): 
      soup.p.wrap(soup.h1_tag("h1")) 

如果使用上面的例子HTML的輸入,我想這樣的代碼出來:

<h1> ExampleStringWord#1 needs to 「find」 this entire element based on the "finding" of the first word </h1> 
<p> Example#2 this element ignored </p> 
<h1> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m 「searching」 for, even though the wording after the first word in the string is different <h1> 

但是,我的代碼只發現元素它明確包含「ExampleStringWord#1」,並將排除包含任何字符串過去的字詞的元素。 我相信我會以某種方式需要利用正則表達式來查找我指定的單詞(除了後面的任何字符串用語)元素。不過,我對正則表達式並不是很熟悉,所以我不確定如何結合BeautifulSoup模塊來處理這個問題。

另外 - 我已經瀏覽了Beautiful Soup中用於傳遞正則表達式作爲過濾器的文檔(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression),但是我無法在我的情況下使用它。我也回顧了其中一些與通​​過美麗的湯傳遞正則表達式有關的帖子,但是我沒有發現任何能夠充分解決我的問題的東西。 任何幫助表示讚賞!

回答

2

如果你會找到p元素與指定的字符串(注意re.compile()部分),然後用h1替換元素的name:

import re 

from bs4 import BeautifulSoup 

data = """ 
<body> 
    <p> ExampleStringWord#1 needs to 「find」 this entire element based on the "finding" of the first word </p> 
    <p> Example#2 this element ignored </p> 
    <p> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m 「searching」 for, even though the wording after the first word in the string is different </p> 
</body> 
""" 

soup = BeautifulSoup(data, "html.parser") 
for p in soup.find_all("p", string=re.compile("ExampleStringWord#1")): 
    p.name = 'h1' 
print(soup) 

打印:

<body> 
    <h1> ExampleStringWord#1 needs to 「find」 this entire element based on the "finding" of the first word </h1> 
    <p> Example#2 this element ignored </p> 
    <h1> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m 「searching」 for, even though the wording after the first word in the string is different </h1> 
</body>