2017-02-04 94 views
2

如何刪除任何HTML標記特定的圖案內beautifulsoup

<p> 
 
A 
 
<span>die</span> 
 
    is thrown \(x = {-b \pm 
 
    <span>\sqrt</span> 
 
    {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from 
 
both the throws? 
 
</p>

在上面的html我需要先在「\(標籤\)」即\(x = {-b \pm <span>\sqrt</span> {b^2-4ac} \over 2a}\\)只刪除標籤。 我剛剛開始與美麗,有沒有什麼辦法可以實現與美麗的?

回答

2

我想出瞭解決我的問題。希望它能幫助別人。隨意給我一些改進代碼的建議。

from bs4 import BeautifulSoup 
import re 
html = """<p> 
    A 
    <span>die</span> 
     is thrown \(x = {-b \pm 
     <span>\sqrt</span> 
     {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from 
    both the throws? 
    </p> <p> Test </p>""" 

soup = BeautifulSoup(html, 'html.parser') 
mathml_start_regex = re.compile(r'\\\(') 
mathml_end_regex = re.compile(r'\\\)') 

for p_tags in soup.find_all('p'): 
    match = 0 #Flag set to 1 if '\(' is found and again set back to 0 if '\)' is found. 
    for p_child in p_tags.children: 
     try: #Captures Tags that contains \(
      if re.findall(mathml_start_regex, p_child.text): 
       match += 1 
     except: #Captures NavigableString that contains \(
      if re.findall(mathml_start_regex, p_child): 
       match += 1 
     try: #Replaces Tag with Tag's text 
      if match == 1: 
       p_child.replace_with(p_child.text) 
     except: #No point in replacing NavigableString since they are just strings without Tags 
      pass 
     try: #Captures Tags that contains \) 
      if re.findall(mathml_end_regex, p_child.text): 
       match = 0 
     except: #Captures NavigableString that contains \) 
      if re.findall(mathml_end_regex, p_child): 
       match = 0 

輸出:

<p> 
    A 
    <span>die</span> 
     is thrown \(x = {-b \pm 
     \sqrt 
     {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from 
    both the throws? 
    </p> 
<p> Test 
</p> 

在上面的代碼我搜索所有的 'p' 標籤和它返回bs4.element.ResultSet。在第一個for循環中,我迭代到結果集以獲取單獨的'p'標記,並在第二個循環中使用該循環。 孩子生成器遍歷'p'標籤子元素(包含可導航的字符串和標籤)。每個'p'標籤的孩子都會搜索'\(',如果發現匹配設置爲1,並且如果迭代到匹配的孩子爲1,則使用replace_with刪除特定孩子中的標籤,最後當找到'\)'時匹配被設置爲零。

0

美麗的湯獨自不能得到一個子字符串。你可以使用正則表達式。

from bs4 import BeautifulSoup 
import re 

html = """<p> 
    A 
    <span>die</span> 
     is thrown \(x = {-b \pm 
     <span>\sqrt</span> 
     {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from 
    both the throws? 
    </p>""" 

soup = BeautifulSoup(html, 'html.parser') 

print re.findall(r'\\\(.*?\)', soup.text, re.DOTALL) 

輸出:

[u'\\(x = {-b \\pm \n \\sqrt\n {b^2-4ac} \\over 2a}\\)'] 

正則表達式:

\\\(.*?\) - Get substring from (to). 

如果你想要去除的換行和空格,你可以像這樣:

res = re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)[0] 
print ' '.join(res.split()) 

輸出:

\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\) 

串繞HTML包裝:

print BeautifulSoup(' '.join(res.split())) 

輸出:

<html><body><p>\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)</p></body></html> 
+0

嗨我預計輸出爲[u'\\(x = {-b \\ pm \ n \\ sqrt \ n {b^2-4ac} \\ 2a} \\)']。你能建議改變正則表達式嗎? – waranlogesh

+0

@waranlogesh當然。在'('。')之前也加上反斜槓修改解決方案 – MYGz

+0

有沒有辦法將打印的更改保存到html? – waranlogesh