在過去,當我使用BeautifulSoup和lxml解析網頁時,它非常容易,因爲鏈接看起來像這樣:<a href="www.website.com">Website</a>
。但是,我遇到了一些鏈接出現在瀏覽器中但不在頁面源中的網頁。解析Javascript生成HTML的鏈接?
例如,在this Edmunds.com page,所述Past Long-Term Road Tests
部分看起來像這樣:
1991 Acura NSX
2011 Acura TSX Sport Wagon
...
然而,爲了the page的Past Long Long-Term Road Tests
部的源代碼看起來是這樣的:
<script type="text/javascript">
PAGESETUP.addControl(function() {
function linksObj(){
var elink = "|acura|nsx|1991|long-term-road-test|"; //generates edmunds.com/acura/nsx/1991/long-term-road-test/
this.link0 = {anchor:elink,label:"1991 Acura NSX"};
var elink = "|acura|tsx-sport-wagon|2011|long-term-road-test|"; //generates edmunds.com/acura/tsx-sport-wagon/1991/long-term-road-test/
this.link1 = {anchor:elink,label:"2011 Acura TSX Sport Wagon"};
...
}
var links_obj = new linksObj();
var links_container = document.getElementById('links_list_offpage2');
var more_link = "";
var more_link_text = "";
var elinks = new EDMUNDS.linksList(links_obj, links_container,more_link, more_link_text);
}, 'low');
</script>
Javascript行var elink = "|acura|nsx|1991|long-term-road-test|";
在瀏覽器中被擴展爲edmunds.com/acura/nsx/1991/long-term-road-test
。
工具像BeautifulSoup和LXML目前還沒有找到那些在Javascript中生成的鏈接。我如何解析這些鏈接?
複製'EDMUNDS.linkList'函數我猜 – 2013-02-15 05:56:07