用BeautifulSoup分解元素

我有一些我用BeautifulSoup解析的html代碼。其中一個要求是標籤不嵌套在段落或其他文本標籤中。用BeautifulSoup分解元素

例如，如果我有這樣的代碼：

<p> 
    first text 
    <a href="..."> 
     <img .../> 
    </a> 
    second text 
</p>

我需要把它改造成這樣的：

<p>first text</p> 
<img .../> 
<p>second text</p>

我做了一些提取圖像和後添加他們該段落，像這樣：

for match in soup.body.find_all(True, recursive=False):     
    try:    
     for desc in match.descendants: 
      try: 
       if desc.name in ['img']:  

        if (hasattr(desc, 'src')):        
         # add image as an independent tag 
         tag = soup.new_tag("img") 
         tag['src'] = desc['src'] 

         if (hasattr(desc, 'alt')): 
          tag['alt'] = desc['alt'] 
         else 
          tag['alt'] = '' 

         match.insert_after(tag) 

        # remove image from its container        
        desc.extract() 

      except AttributeError: 
       temp = 1 

    except AttributeError: 
     temp = 1

我寫了另一段代碼刪除空的電子郵件lement（像它的圖像被刪除後留空的標籤），但我不知道如何將元素拆分爲兩個不同的元素。

來源

2012-09-27 alex.ac

import string 
the_string.split(the_separator[,the_limit])

這將產生一個數組，因此您可以通過for循環或獲取元素manualy。

the_limit不需要

在你的情況我認爲the_separator需要「\ n」但是，從案件依賴於情況。解析是非常有趣的，但有時候是一件棘手的事情。

來源

2012-09-27 08:23:20 Develoger

我試圖遠離字符串解析，因爲我可能會結束與未封閉的標籤。我希望BeautifulSoup知道如何修復html代碼並使其有效。無論哪種方式，我會嘗試一下，看看會發生什麼:) –

美麗的肥皂有美化選項，所以做這個soup.prettify（）來測試它，它會返回格式良好的HTML。 – Develoger

@DušanRadojević美麗的肥皂總是洗的HTML（： – Rubens

-1

from bs4 import BeautifulSoup as bs 
from bs4 import NavigableString 
import re 

html = """ 
<div> 
<p> <i>begin </i><b>foo1</b><i>bar1</i>SEPATATOR<b>foo2</b>some text<i>bar2 </i><b>end </b> </p> 
</div> 
""" 
def insert_tags(parent,tag_list): 
    for tag in tag_list: 
     if isinstance(tag, NavigableString): 
      insert_tag = s.new_string(tag.string) 
      parent.append(insert_tag) 
     else: 
      insert_tag = s.new_tag(tag.name) 
      insert_tag.string = tag.string 
      parent.append(insert_tag) 

s = bs(html) 
p = s.find('p') 
print s.div 
m = re.match(r"^<p>(.*?)(SEPATATOR.*)</p>$", str(p)) 
part1 = m.group(1).strip() 
part2 = m.group(2).strip() 

part1_p = s.new_tag("p") 
insert_tags(part1_p,bs(part1).contents) 

part2_p = s.new_tag("p") 
insert_tags(part2_p,bs(part2).contents) 

s.div.p.replace_with(part2_p) 
s.div.p.insert_before(part1_p) 
print s.div

因爲我沒有爲此目的使用嵌套的HTML，所以適合我。無可否認，它仍然看起來很尷尬。它產生在我的例子

<div> 
<p><i>begin </i><b>foo1</b><i>bar1</i></p> 
<p>SEPATATOR<b>foo2</b>some text<i>bar2 </i><b>end </b></p> 
</div>

來源

2013-02-01 17:48:21 user1491229

用BeautifulSoup分解元素

回答

相關問題