2011-09-22 47 views
2

我想知道如何修復損壞的html標籤,然後用美麗的湯進行解析。美麗的湯 - 如何修復損壞的標籤

在以下腳本中td>需要替換爲<td

我該如何做替換如此美麗的湯可以看到它?

from BeautifulSoup import BeautifulSoup 

s = """ 
<tr> 
td>LABEL1</td><td>INPUT1</td> 
</tr> 
<tr> 
<td>LABEL2</td><td>INPUT2</td> 
</tr>""" 

a = BeautifulSoup(s) 

left = [] 
right = [] 

for tr in a.findAll('tr'): 
    l, r = tr.findAll('td') 
    left.extend(l.findAll(text=True)) 
    right.extend(r.findAll(text=True)) 

print left + right 

回答

2

編輯(工作):

我抓住所有的HTML標記的完整(至少應該是完整的)名單從W3來匹配。試試看:

fixedString = re.sub(">\s*(\!--|\!DOCTYPE|\ 
          a|abbr|acronym|address|applet|area|\ 
          b|base|basefont|bdo|big|blockquote|body|br|button|\ 
          caption|center|cite|code|col|colgroup|\ 
          dd|del|dfn|dir|div|dl|dt|\ 
          em|\ 
          fieldset|font|form|frame|frameset|\ 
          head|h1|h2|h3|h4|h5|h6|hr|html|\ 
          i|iframe|img|input|ins|\ 
          kbd|\ 
          label|legend|li|link|\ 
          map|menu|meta|\ 
          noframes|noscript|\ 
          object|ol|optgroup|option|\ 
          p|param|pre|\ 
          q|\ 
          s|samp|script|select|small|span|strike|strong|style|sub|sup|\ 
          table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\ 
          u|ul|\ 
          var)>", "><\g<1>>", s) 
bs = BeautifulSoup(fixedString) 

產地:

>>> print s 

<tr> 
td>LABEL1</td><td>INPUT1</td> 
</tr> 
<tr> 
<td>LABEL2</td><td>INPUT2</td> 
</tr> 

>>> print re.sub(">\s*(\!--|\!DOCTYPE|\ 
         a|abbr|acronym|address|applet|area|\ 
         b|base|basefont|bdo|big|blockquote|body|br|button|\ 
         caption|center|cite|code|col|colgroup|\ 
         dd|del|dfn|dir|div|dl|dt|\ 
         em|\ 
         fieldset|font|form|frame|frameset|\ 
         head|h1|h2|h3|h4|h5|h6|hr|html|\ 
         i|iframe|img|input|ins|\ 
         kbd|\ 
         label|legend|li|link|\ 
         map|menu|meta|\ 
         noframes|noscript|\ 
         object|ol|optgroup|option|\ 
         p|param|pre|\ 
         q|\ 
         s|samp|script|select|small|span|strike|strong|style|sub|sup|\ 
         table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\ 
         u|ul|\ 
         var)>", "><\g<1>>", s) 

<tr><td>LABEL1</td><td>INPUT1</td> 
</tr> 
<tr> 
<td>LABEL2</td><td>INPUT2</td> 
</tr> 

這一個應該與破碎的結束標記以及(</endtag>):

re.sub(">\s*(/?)(\!--|\!DOCTYPE|\a|abbr|acronym|address|applet|area|\ 
       b|base|basefont|bdo|big|blockquote|body|br|button|\ 
       caption|center|cite|code|col|colgroup|\ 
       dd|del|dfn|dir|div|dl|dt|\ 
       em|\ 
       fieldset|font|form|frame|frameset|\ 
       head|h1|h2|h3|h4|h5|h6|hr|html|\ 
       i|iframe|img|input|ins|\ 
       kbd|\ 
       label|legend|li|link|\ 
       map|menu|meta|\ 
       noframes|noscript|\ 
       object|ol|optgroup|option|\ 
       p|param|pre|\ 
       q|\ 
       s|samp|script|select|small|span|strike|strong|style|sub|sup|\ 
       table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\ 
       u|ul|\ 
       var)>", "><\g<1>\g<2>>", s) 
+0

沒有運氣。輸出:<><><><<><><><><><><<><><> – howtodothis

+1

@terra修復了正則表達式並編輯了我的答案,試試我現在在那裏的re.sub。我測試了它,它應該工作得很好。它檢查所有html標籤的完整列表(從w3中拉出)。 – chown

2

如果這是你的唯一」關注td> - >,請嘗試:

myString = re.sub('td>', '<td>', myString) 

將myString發送到BeautifulSoup之前。如果有其他破碎的標籤給我們一些例子,我們將繼續努力:)