2014-03-01 86 views
1

我在使用Python中的lxml解析JS時遇到了問題。當我執行下面的代碼,我的輸出是:使用lxml在python中解析html和js

「在0x10cec4e10 <元素DIV>」

from lxml.html.clean import Cleaner 
cleaner = Cleaner() 
cleaner.javascript = True 

text = urllib2.urlopen("URL").read().decode("utf-8") 
test = lxml.html.fromstring(cleaner.clean_html(text)) 
print test 

我想要得到的是沒有JS的東西解析的文本。有人可以點亮一些光線嗎?謝謝。

回答

1
import lxml 
import urllib2 

URL = "http://www.google.com/" 
ENCODING = "latin1" 

args = { 
    "javascript": True,   # strip javascript 
    "page_structure": False, # leave page structure alone 
    "style": True    # remove CSS styling 
} 
cleaner = lxml.html.clean.Cleaner(**args) 

# get the page source 
html = urllib2.urlopen(URL).read().decode(ENCODING) 
# clean it up 
clean = cleaner.clean_html(html) 

# print unformatted html dump 
print(clean) 

# print properly indented html 
tree = lxml.html.fromstring(clean) 
print(lxml.etree.tostring(tree, pretty_print=True)) 

需要注意的是漂亮的打印工作正常與lxml.etree.tostring(),但用不好lxml.html.tostring(),它不換行但不能縮進 - 去圖。