2012-02-02 47 views
0

試圖處理一個非常簡單的HTML5腳本,並使用html5lib這個html5lib腳本是怎麼回事?

import html5lib 

html = '''<!DOCTYPE html> 
<html lang="en"> 
    <head> 
     <title>Hi</title> 
    </head> 
    <body> 
     <script src="a.js"></script> 
     <script src="b.js"></script> 
    </body> 
</html> 
''' 

parser = html5lib.HTMLParser(tree = html5lib.treebuilders.getTreeBuilder("lxml")) 
walker = html5lib.treewalkers.getTreeWalker("lxml") 
serializer = html5lib.serializer.htmlserializer.HTMLSerializer() 

document = parser.parse(html) 
stream = walker(document) 
theHTML = serializer.render(stream) 

print theHTML 

輸出使它看起來像:

<!DOCTYPE html><html lang=en><head> 
     <title>Hi</title> 
    </head> 
    <body> 
     <script src=a.js></script> 
     <script src=b.js></script> 

是啊。它只是在中途切斷。將樹生成器從lxml更改爲dom不會執行任何操作。調整HTML會改變輸出,但它仍然非常腐敗。

回答

1

因此,關鍵似乎是omit_optional_tags=False某種程度上缺少它吃掉輸出結束。

parser = html5lib.HTMLParser(tree = html5lib.treebuilders.getTreeBuilder("lxml")) 
document = parser.parse(html)  
walker = html5lib.treewalkers.getTreeWalker("lxml") 
stream = walker(document) 
s = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False) 
output_generator = s.serialize(stream) 
for item in output_generator: 
     print item 


<!DOCTYPE html> 
<html lang=en> 
<head> 


<title> 
Hi 
</title> 


</head> 


<body> 


<script src=a.js> 
</script> 


<script src=b.js> 
</script> 




</body> 
</html> 
>>> 
+0

@schwa:請編輯我的答案和適當的解釋。 – RanRag 2012-02-02 05:55:53

+0

無法使用您的代碼重現。 's'甚至沒有在你的代碼中定義。想用無錯的代碼編輯你的回覆? – schwa 2012-02-02 06:05:59

+0

@schwa看到我編輯的代碼。 – RanRag 2012-02-02 06:21:35