我想用一些Python網絡爬蟲從網站上下載大約3000個PDF文件。但是,這些PDF的URL由JavaScript函數生成。所以,我想知道是否有任何教程如何實現這一目標?用於JavaScript的Python網絡爬蟲生成的URL
例如,鏈接到Alberto European Hairspray (Aerosol) - All Variants
的URL將在單擊onclick="javascript:__doPostBack('ctl00$placeBody$gridView$gridView','DocumentCenter.aspx?did={0}$0'
後生成。 所以問題是如何讓網絡爬蟲獲得計算出的URL。
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
<tbody>
<tr>
<td>
<input type="image" src="App_Graphics/PDFDocument.gif" alt="MSDS" onclick="javascript:__doPostBack('ctl00$placeBody$gridView$gridView','DocumentCenter.aspx?did={0}$0');return false;" />
</td>
<td><a href="javascript:__doPostBack('ctl00$placeBody$gridView$gridView','MSDSDetail.aspx?did={0}$0')">Alberto European Hairspray (Aerosol) - All Variants</a>
</td>
<td>Unilever PLC</td>
<td>8131-01</td>
</tr>
<tr class="row-alternate">
<td>
<input type="image" src="App_Graphics/PDFDocument.gif" alt="MSDS" onclick="javascript:__doPostBack('ctl00$placeBody$gridView$gridView','DocumentCenter.aspx?did={0}$1');return false;" />
</td>
<td><a href="javascript:__doPostBack('ctl00$placeBody$gridView$gridView','MSDSDetail.aspx?did={0}$1')">Alberto European Mousse (Aerosol) - All Variants</a>
</td>
<td>Unilever PLC</td>
<td>8132-01</td>
</tr>
</tbody>
感謝您指點我這些來源! –
@ tao.hong不客氣。我有同樣的問題,我知道這有點令人失望:P – cdonts