2015-06-09 65 views
1

我正在使用Python將HMTL代碼傳遞給BeautifulSoup,並且我的輸出由HTML註釋錯誤。我有這個python腳本來刪除HTML註釋,但它無法刪除嵌入CSS註釋的HTML註釋。從HTML文檔(Python)中提取CSS和HTML註釋

我的代碼:

from bs4 import BeautifulSoup, Comment 

    input_text = "" 

    for line in open('output.txt'): 
      input_text+=line 

    soup = BeautifulSoup(input_text) 
    comments = soup.findAll(text=lambda text:isinstance(text, Comment)) 
    [comment.extract() for comment in comments] 
    print soup 

例如,它會從我的測試輸入的所有HTML註釋除外:

<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:inherit;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman",serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-family:"Calibri",sans-serif;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --> 

下面是代碼,其中包括2個評論的輸入塊,它在運行我的腳本之後成功刪除,以及未刪除的評論:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="Generator" content="Microsoft Word 15 (filtered medium)"> <!--[if !mso]><style>v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} </style><![endif]--><style><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:inherit;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman",serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-family:"Calibri",sans-serif;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]--> 

我不確定首先去除CSS註釋的最佳方式。我不需要打擾刪除CSS註釋的內容,只是/ * * /,因爲其餘部分應該被嵌入在HTML註釋中剝離

+0

無法重現問題。該評論位於輸入HTML中的哪個位置? – alecxe

+0

對不起,我編輯了我原來的帖子以包含更多的代碼 –

回答

3

我解決了我的問題。我使用正則表達式去除它們,對於任何有好奇心的人,這裏是我的新代碼:

from bs4 import BeautifulSoup, Comment 
import re 

input_text = "" 

for line in open('output.txt'): 
    input_text+=line 

#extract all CSS comments 
text = re.sub('\/*', '', input_text) 
text = re.sub('\*/', '', text) 

soup = BeautifulSoup(text) 

#extract all HTML comments 
comments = soup.findAll(text=lambda text:isinstance(text, Comment)) 
[comment.extract() for comment in comments] 

print soup