如何刪除任何HTML標記特定的圖案內beautifulsoup

<p> 
 
A 
 
<span>die</span> 
 
    is thrown \(x = {-b \pm 
 
    <span>\sqrt</span> 
 
    {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from 
 
both the throws? 
 
</p>

在上面的html我需要先在「\（標籤\）」即\(x = {-b \pm <span>\sqrt</span> {b^2-4ac} \over 2a}\\)只刪除標籤。我剛剛開始與美麗，有沒有什麼辦法可以實現與美麗的？

來源

2017-02-04 waranlogesh

我想出瞭解決我的問題。希望它能幫助別人。隨意給我一些改進代碼的建議。

from bs4 import BeautifulSoup 
import re 
html = """<p> 
    A 
    <span>die</span> 
     is thrown \(x = {-b \pm 
     <span>\sqrt</span> 
     {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from 
    both the throws? 
    </p> <p> Test </p>""" 

soup = BeautifulSoup(html, 'html.parser') 
mathml_start_regex = re.compile(r'\\\(') 
mathml_end_regex = re.compile(r'\\\)') 

for p_tags in soup.find_all('p'): 
    match = 0 #Flag set to 1 if '\(' is found and again set back to 0 if '\)' is found. 
    for p_child in p_tags.children: 
     try: #Captures Tags that contains \(
      if re.findall(mathml_start_regex, p_child.text): 
       match += 1 
     except: #Captures NavigableString that contains \(
      if re.findall(mathml_start_regex, p_child): 
       match += 1 
     try: #Replaces Tag with Tag's text 
      if match == 1: 
       p_child.replace_with(p_child.text) 
     except: #No point in replacing NavigableString since they are just strings without Tags 
      pass 
     try: #Captures Tags that contains \) 
      if re.findall(mathml_end_regex, p_child.text): 
       match = 0 
     except: #Captures NavigableString that contains \) 
      if re.findall(mathml_end_regex, p_child): 
       match = 0

輸出：

<p> 
    A 
    <span>die</span> 
     is thrown \(x = {-b \pm 
     \sqrt 
     {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from 
    both the throws? 
    </p> 
<p> Test 
</p>

在上面的代碼我搜索所有的 'p' 標籤和它返回bs4.element.ResultSet。在第一個for循環中，我迭代到結果集以獲取單獨的'p'標記，並在第二個循環中使用該循環。孩子生成器遍歷'p'標籤子元素（包含可導航的字符串和標籤）。每個'p'標籤的孩子都會搜索'\（'，如果發現匹配設置爲1，並且如果迭代到匹配的孩子爲1，則使用replace_with刪除特定孩子中的標籤，最後當找到'\）'時匹配被設置爲零。

來源

2017-02-08 05:22:40 waranlogesh

美麗的湯獨自不能得到一個子字符串。你可以使用正則表達式。

from bs4 import BeautifulSoup 
import re 

html = """<p> 
    A 
    <span>die</span> 
     is thrown \(x = {-b \pm 
     <span>\sqrt</span> 
     {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from 
    both the throws? 
    </p>""" 

soup = BeautifulSoup(html, 'html.parser') 

print re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)

輸出：

[u'\\(x = {-b \\pm \n \\sqrt\n {b^2-4ac} \\over 2a}\\)']

正則表達式：

\\\(.*?\) - Get substring from (to).

如果你想要去除的換行和空格，你可以像這樣：

res = re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)[0] 
print ' '.join(res.split())

輸出：

個

\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)

串繞HTML包裝：

print BeautifulSoup(' '.join(res.split()))

輸出：

<html><body><p>\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)</p></body></html>

來源

2017-02-04 12:21:30 MYGz

嗨我預計輸出爲[u'\\（x = {-b \\ pm \ n \\ sqrt \ n {b^2-4ac} \\ 2a} \\）']。你能建議改變正則表達式嗎？ – waranlogesh

@waranlogesh當然。在'（'。'）之前也加上反斜槓修改解決方案 – MYGz

有沒有辦法將打印的更改保存到html？ – waranlogesh

如何刪除任何HTML標記特定的圖案內beautifulsoup

回答

相關問題