2016-01-21 111 views
1

我正在寫一個使用Scrapy的python腳本來抓取有登錄頁面的網站。我試圖用Scrapy中的FormRequest.from_response填充表單,但我不成功,不知道爲什麼,但它看起來像from_response中聲明的回調函數沒有被調用。Python Scrapy FormRequest回調沒有發生

我Spyder的代碼如下:

class user_scrape(CrawlSpider): 
name = "spyder" 
allowed_domains = ["domain.tld"] 
start_urls = [ 
    "http://domain.tld/page1", 
    "http://domain.tld/page2" 
] 

login_user = "username" 
login_pass = "secret" 
login_page = "http://domain.tld/login.php" 

def start_requests(self): 
    yield Request(
     url=self.login_page, 
     callback=self.login, 
     dont_filter=True, 
    ) 

def login(self, response): 
    print "----- LOGIN -----" 
    return FormRequest.from_response(
     response, 
     formname='form_login', 
     formdata={ 
      'username': self.login_user, 
      'password': self.login_pass, 
      'cookietime': 'on', 
     }, 
     callback=self.check_login_response, 
    ) 

def check_login_response(self, response): 
    print response.url 
    print response.body 

    return [Request(url=url)for url in self.start_urls] 

def parse(self, response): 
    print response.url 

當我運行它打印「登錄」,然後它似乎停止,而不會進入「check_login_response」,它應該繼續Spyder的。

Spyder的日誌如下:

2016-01-21 16:34:23 [scrapy] INFO: Scrapy 1.0.4 started (bot: UsersScrape) 
2016-01-21 16:34:23 [scrapy] INFO: Optional features available: ssl, http11 
2016-01-21 16:34:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'UsersScrape.spiders', 'SPIDER_MODULES': ['UsersScrape.spiders'], 'RETRY_TIMES': 5, 'BOT_NAME': 'UsersScrape', 'RETRY_HTTP_CODES': [400, 408, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530], 'DOWNLOAD_DELAY': 1, 'USER_AGENT': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'} 
2016-01-21 16:34:24 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-01-21 16:34:24 [scrapy] INFO: Enabled downloader middlewares: RetryMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-01-21 16:34:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-01-21 16:34:24 [scrapy] INFO: Enabled item pipelines: 
2016-01-21 16:34:24 [scrapy] INFO: Spider opened 
2016-01-21 16:34:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-01-21 16:34:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-01-21 16:34:24 [scrapy] DEBUG: Crawled (200) <GET http://domain.tld/login.php?> (referer: None) 
----- LOGIN ----- 
2016-01-21 16:34:25 [scrapy] DEBUG: Redirecting (302) to <GET http://domain.tld.com/> from <POST http://domain.tld/takelogin.php> 
2016-01-21 16:34:27 [scrapy] DEBUG: Redirecting (302) to <GET http://domain.tld/> from <GET http://domain.tld/> 
2016-01-21 16:34:27 [scrapy] DEBUG: Filtered duplicate request: <GET http://domain.tld/> 
2016-01-21 16:34:27 [scrapy] INFO: Closing spider (finished) 
2016-01-21 16:34:27 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 1261, 
'downloader/request_count': 3, 
'downloader/request_method_count/GET': 2, 
'downloader/request_method_count/POST': 1, 
'downloader/response_bytes': 3877, 
'downloader/response_count': 3, 
'downloader/response_status_count/200': 1, 
'downloader/response_status_count/302': 2, 
'dupefilter/filtered': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 1, 21, 15, 34, 27, 101000), 
'log_count/DEBUG': 5, 
'log_count/INFO': 7, 
'request_depth_max': 1, 
'response_received_count': 1, 
'scheduler/dequeued': 3, 
'scheduler/dequeued/memory': 3, 
'scheduler/enqueued': 3, 
'scheduler/enqueued/memory': 3, 
'start_time': datetime.datetime(2016, 1, 21, 15, 34, 24, 238000)} 
2016-01-21 16:34:27 [scrapy] INFO: Spider closed (finished) 

形式的HTML代碼:

<form method="post" name="login_form" action="takelogin.php" onsubmit="return startLoginVerify();"> 
    <table id="login_form" border="0" cellpadding=5> 
    <tr> 
    <td colspan="2" align="right"> 
     <img style="cursor:pointer;" onClick="close_login_box();" src="pic/close.gif" align="right"> 
    </td> 
    </tr> 
    <tr> 
    <td class=rowhead style="padding-left:25px;">User:</td> 
    <td align=left style="padding-right:25px;"> 
     <input type="text" size=30 name="username" id="navbar_login_menu_input_to_focus_on" /> 
    </td> 
    </tr> 
    <tr> 
    <td class=rowhead>Password:</td> 
    <td align=left><input type="password" size=30 name="password" /></td> 
    </tr> 
    .... 
    </table> 
</form> 

我已檢查了FormRequest導遊,我看不出有什麼區別可能導致我不工作。

謝謝你的時間和幫助!

回答

1

該日誌顯示該請求正在被過濾,因爲您正在訪問同一個網址兩次(使相同的請求完全準確)。

嘗試設置dont_filter=True的登錄請求:

FormRequest.from_response(
    response, 
    formname='form_login', 
    formdata={ 
     'username': self.login_user, 
     'password': self.login_pass, 
     'cookietime': 'on', 
    }, 
    callback=self.check_login_response, 
    dont_filter=True, 
) 
+0

你是對的,謝謝! –