2016-07-24 83 views
2

颳去excel網站在過去的幾天中,我試圖取消在表格中有幾個excels和pdfs的以下網站(鏈接粘貼在下面)。我能夠成功完成主頁。總共有59頁,這些excels/pdfs必須報廢。在我看到的大多數網站中,直到現在,網站url中都有一個查詢參數,當您從一個頁面移動到另一個頁面時,該參數會發生變化。在這種情況下,我們有一個_doPostBack函數,可能是因爲它在每個頁面上的URL都保持不變。我查看了多個解決方案和帖子,這些解決方案和帖子建議查看post調用的參數並使用它們,但我無法理解在post調用中提供的參數(這是我第一次取消網站)。使用python與_doPostBack鏈接url隱藏

有人可以請建議一些資源,可以幫助我編寫一個代碼,它可以幫助我使用python從一個頁面移動到另一個頁面。具體內容如下:

網站鏈接 - http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx

我當前的代碼中提取從主頁的CAP Excel工作表(這是工作完美的,僅供參考提供)

from urllib.request import urlopen 
from urllib.request import urlretrieve 
from bs4 import BeautifulSoup 
import re 
import urllib 

Base = "http://accord.fairfactories.org/ffcweb/Web" 
html = urlopen("http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx") 
bs = BeautifulSoup(html) 
name = bs.findAll("td", {"class":"column_style_right column_style_left"}) 
i = 1 
for link in bs.findAll("a", {"id":re.compile("CAP(?!\w)")}): 
    if 'href' in link.attrs: 
     name = str(i)+".xlsx" 
     a = link.attrs['href'] 
     b = a.strip("..") 
     c = Base+b 
     urlretrieve(c, name) 
     i = i+1 

請讓我知道,如果我在提供信息時遺漏了任何東西,請不要評價我 - 我也無法進一步提出任何問題

回答

0

對於aspx網站,您需要尋找像__EVENTTARGET__EVENTVALIDATION等,並張貼在每個請求的參數,這將讓所有的頁面,並使用requestsBS4

import requests 
from bs4 import BeautifulSoup 
from urlparse import urljoin # python 3 use from urllib.parse import urljoin 



# All the keys need values set bar __EVENTTARGET, that stays the same. 
data = { 
    "__EVENTTARGET": "gvFlex", 
    "__VIEWSTATE": "", 
    "__VIEWSTATEGENERATOR": "", 
    "__VIEWSTATEENCRYPTED": "", 
    "__EVENTVALIDATION": ""} 


def validate(soup, data): 
    for k in data: 
     # update post values in data. 
     if k != "__EVENTTARGET": 
      data[k] = soup.select_one("#{}".format(k))["value"] 


def get_all_excel(): 
    base = "http://accord.fairfactories.org/ffcweb/Web" 
    url = "http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx" 
    with requests.Session() as s: 
     # Add a user agent for each subsequent request. 
     s.headers.update({"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"}) 
     r = s.get(url) 
     bs = BeautifulSoup(r.content, "lxml") 
     # get links from initial page. 
     for xcl in bs.select("a[id*=CAP]"): 
      yield urljoin(base, xcl["href"]) 
     # need to re-validate the post data in our dict for each request. 
     validate(bs, data) 
     last = bs.select_one("a[href*=Page$Last]") 
     i = 2 
     # keep going until the last page button is not visible 
     while last: 
      # Increase the counter to set the target to the next page 
      data["__EVENTARGUMENT"] = "Page${}".format(i) 
      r = s.post(url, data=data) 
      bs = BeautifulSoup(r.content, "lxml") 
      for xcl in bs.select("a[id*=CAP]"): 
       yield urljoin(base, xcl["href"]) 
      last = bs.select_one("a[href*=Page$Last]") 
      # again re-validate for next request 
      validate(bs, data) 
      i += 1 


for x in (get_all_excel()): 
    print(x) 

如果我們前三頁上運行它,你可以看到,我們得到你想要的數據:

http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9965 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9552 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10650 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11969 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10086 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10905 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10840 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9229 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11310 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9178 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9614 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9734 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10063 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10871 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9468 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9799 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9278 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12252 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9342 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9966 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11595 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9652 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10271 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10365 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10087 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9967 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11740 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12375 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11643 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10952 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12013 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9810 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10953 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10038 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9664 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12256 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9262 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9210 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9968 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9811 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11610 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9455 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11899 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10273 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9766 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9969 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10088 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10366 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9393 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9813 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11795 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9814 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11273 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12187 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10954 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9556 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11709 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9676 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10251 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10602 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10089 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9908 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10358 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9469 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11333 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9238 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9816 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9817 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10736 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10622 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9394 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9818 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10592 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9395 
http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11271 
+0

謝謝噸Padraic。你是明星:) –

+0

@ujjwaldalmia,不用擔心,不客氣。 –

+0

親愛的Padraic當我嘗試執行代碼時,出現以下錯誤。 –