2017-02-20 47 views
0

平臺:Debian的8 +蟒蛇3.4 + Scrapy 1.3.2 這裏是我的蜘蛛下載某些URL形成yahoo.com爲什麼錯誤信息不能記錄到指定的文件中?

import scrapy 
import csv 

class TestSpider(scrapy.Spider): 
    name = "quote" 
    allowed_domains = ["yahoo.com"] 
    start_urls = ['url1','url2','url3',,,,'urls100'] 

    def parse(self, response): 
     filename = response.url.split("=")[1] 
     open('/tmp/'+filename+'.csv', 'wb').write(response.body) 

某些錯誤信息出現時,執行它:

2017-02-19 21:28:27 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response 
<404 https://chart.yahoo.com/table.csv?s=GLU>: HTTP status code is not handled or not allowed 

https://chart.yahoo.com/table.csv?s=GLU是start_urls之一。

現在我想抓住錯誤信息。

import scrapy 
import csv 

import logging 
from scrapy.utils.log import configure_logging 
configure_logging(install_root_handler=False) 
logging.basicConfig(
    filename='/tmp/log.txt', 
    format='%(levelname)s: %(message)s', 
    level=logging.INFO 
) 

class TestSpider(scrapy.Spider): 
    name = "quote" 
    allowed_domains = ["yahoo.com"] 
    start_urls = ['url1','url2','url3',,,,'url100'] 

    def parse(self, response): 
     filename = response.url.split("=")[1] 
     open('/tmp/'+filename+'.csv', 'wb').write(response.body) 

爲什麼該錯誤信息,如
2017年2月19日21時28分27秒[scrapy.spidermiddlewares.httperror] INFO:忽略響應 < https://chart.yahoo.com/table.csv?s=GLU 404>:HTTP狀態代碼沒有被處理或不允許 不能記錄在/home/log.txt中?

想到eLRuLL,我加了handle_httpstatus_list = [404]

import scrapy 
import csv 

import logging 
from scrapy.utils.log import configure_logging 
configure_logging(install_root_handler=False) 
logging.basicConfig(
    filename='/home/log.txt', 
    format='%(levelname)s: %(message)s', 
    level=logging.INFO 
) 

class TestSpider(scrapy.Spider): 
    handle_httpstatus_list = [404] 
    name = "quote" 
    allowed_domains = ["yahoo.com"] 
    start_urls = ['url1','url2','url3',,,,'url100'] 

    def parse(self, response): 
     filename = response.url.split("=")[1] 
     open('/tmp/'+filename+'.csv', 'wb').write(response.body) 

錯誤信息仍然不能記錄到/home/log.txt文件中,爲什麼?

回答

0

使用handle_httpstatus_list屬性上的蜘蛛來處理404狀態:

class TestSpider(scrapy.Spider): 
    handle_httpstatus_list = [404] 
相關問題