2016-08-18 38 views
0

如果表使用__doPostBack函數,如何使用機械化瀏覽網頁上的表格?Python使用__doPostBack函數實現機械化導航

我的代碼是:

import mechanize 
br = mechanize.Browser() 
br.set_handle_robots(False) 
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 
br.open("http://www.gfsc.gg/The-Commission/Pages/Regulated-Entities.aspx?auto_click=1") 

page_num = 2 
for link in br.links(): 
    if link.text == str(page_num): 
     br.open(link) #I suspect this is not correct 
     break 

for link in br.links(): 
    print link.text, link.url 

在表A中搜索所有的控件(例如下拉菜單)不顯示頁面按鈕,但搜索爲表中做的所有環節。頁面按鈕不包含URL,因此它不是典型的鏈接。我得到TypeError:預期的字符串或緩衝區。

我覺得這是可以使用機械化完成的事情。

感謝您的閱讀。

回答

1

機械化可用於導航使用__doPostBack的表。我使用BeautifulSoup解析HTML所需的參數,並遵循有用的guidance with the regex。我的代碼寫在下面。

import mechanize 
import re # write a regex to get the parameters expected by __doPostBack 
from bs4 import BeautifulSoup 
from time import sleep 

br = mechanize.Browser() 
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 
response = br.open("http://www.gfsc.gg/The-Commission/Pages/Regulated-Entities.aspx?auto_click=1") 
# satisfy the __doPostBack function to navigate to different pages 
for pg in range(2,5): 
    br.select_form(nr=0) # the only form on the page 
    br.set_all_readonly(False) # to set the __doPostBack parameters 

    # BeautifulSoup for parsing 
    soup = BeautifulSoup(response, 'lxml') 
    table = soup.find('table', {'class': 'RegulatedEntities'}) 
    records = table.find_all('tr', {'style': ["background-color:#E4E3E3;border-style:None;", "border-style:None;"]}) 

    for rec in records[:1]: 
     print 'Company name:', rec.a.string 

    # disable 'Search' and 'Clear filters' 
    for control in br.form.controls[:]: 
     if control.type in ['submit', 'image', 'checkbox']: 
      control.disabled = True 

    # get parameters for the __doPostBack function 
    for link in soup("a"): 
     if link.string == str(page): 
      next = re.search("""<a href="javascript:__doPostBack\('(.*?)','(.*?)'\)">""", str(link)) 
      br["__EVENTTARGET"] = next.group(1) 
      br["__EVENTARGUMENT"] = next.group(2) 
    sleep(1)  
    response = br.submit()