下面是一個使用的解決方案js2xml:
>>> import js2xml
>>> import pprint
>>> jscode = r"""
... var prefix = 'mailto:';
... var suffix = '';
... var attribs = '';
... var path = 'hr' + 'ef' + '=';
... var addy59933 = 'HR-Cologne' + '@';
... addy59933 = addy59933 + 'scor' + '.' + 'com';
... var addy_text59933 = 'Submit your application';
... document.write('<a ' + path + '\'' + prefix + addy59933 + suffix + '\'' + attribs + '>');
... document.write(addy_text59933);
... document.write('<\/a>');
>>> js = js2xml.parse(jscode)
變量聲明由var_decl
元素來表示,他們的名字是identifier
節點,在這裏它們的值的字符串,+
運營商,因此,讓一個dict
了出來使用"".join()
上string/text()
元素:
>>> # variables
... variables = dict([(var.xpath('string(./identifier)'), u"".join(var.xpath('.//string/text()')))
... for var in js.xpath('.//var_decl')])
>>> pprint.pprint(variables)
{'addy59933': u'HR-Cologne@',
'addy_text59933': u'Submit your application',
'attribs': u'',
'path': u'href=',
'prefix': u'mailto:',
'suffix': u''}
然後分配的變化值一些變量,混合了字符串和變量。串聯%(identifidername)s
可變標識符和字符串值字符串
>>> # identifiers are assigned other string values
... assigns = {}
>>> for assign in js.xpath('.//assign'):
... value = u"".join(['%%(%s)s' % el.text if el.tag=='identifier' else el.text
... for el in assign.xpath('./right//*[self::string or self::identifier]')])
... key = assign.xpath('string(left/identifier)')
... assigns[key] = value
...
>>> pprint.pprint(assigns)
{'addy59933': u'%(addy59933)sscor.com'}
更新變量字典 「應用」 的assignements
>>> # update variables dict with new values
... for key, val in assigns.items():
... variables[key] = val % variables
...
>>> pprint.pprint(variables)
{'addy59933': u'HR-Cologne@scor.com',
'addy_text59933': u'Submit your application',
'attribs': u'',
'path': u'href=',
'prefix': u'mailto:',
'suffix': u''}
>>>
函數的參數是下arguments
節點(XPath的.//arguments/*
):
>>> # interpret arguments of document.write()
... arguments = [u"".join(['%%(%s)s' % el.text if el.tag=='identifier' else el.text
... for el in arg.xpath('./descendant-or-self::*[self::string or self::identifier]')])
... for arg in js.xpath('.//arguments/*')]
>>>
>>> pprint.pprint(arguments)
[u"<a %(path)s'%(prefix)s%(addy59933)s%(suffix)s'%(attribs)s>",
u'%(addy_text59933)s',
u'</a>']
>>>
如果您替換那裏的標識符,您將獲得
>>> # apply string formatting replacing identifiers
... arguments = [arg % variables for arg in arguments]
>>>
>>> pprint.pprint(arguments)
[u"<a href='mailto:HR-Cologne@scor.com'>",
u'Submit your application',
u'</a>']
>>>
現在看起來由我們來運行它通過lxml.html
擺脫數字字符引用的有趣:
>>> import lxml.html
>>> import lxml.etree
>>>
>>> doc = lxml.html.fromstring("".join(arguments))
>>> print lxml.etree.tostring(doc)
<a href="mailto:[email protected]">Submit your application</a>
>>>
使用Scrapy Selector
:
>>> from scrapy.selector import Selector
>>> selector = Selector(text="".join(arguments), type="html")
>>> selector.xpath('.//a/@href').extract()
[u'mailto:[email protected]']
>>>
Scrapy不能處理的情況。您需要運行此js代碼並生成鏈接。一種選擇是使用基於實際瀏覽器的工具,如['selenium'](http://selenium-python.readthedocs.org/en/latest/)。你可以從蜘蛛中啓動它,獲取鏈接,然後關閉瀏覽器。但是,我很確定,這會減慢速度。 – alecxe
[試過的Python BeautifulSoup和Phantom JS:STILL無法抓取網站]的可能的重複(http://stackoverflow.com/questions/22028775/tried-python-beautifulsoup-and-phantom-js-still-cant-scrape-網站) –