2016-03-07 102 views
3

我正在使用Scrapy來抓取不同網站,但實際上我的腳本跟着每個網站,並添加到數據庫的域名,並檢查了PHP腳本過期的域名。Scrapy抓取過期域

我希望有人能夠幫助我改進我的腳本,因爲實際腳本沒有針對我的需要進行優化!

我不知道爲什麼,但抓取工具立即在不同網站上跳轉,找到「起始網址」,如果劇本在跳到其他網站之前完成掃描第一個網站會更好。

如何在將域添加到數據庫之前直接檢查域是否過期?

我的履帶:

from scrapy.spiders import CrawlSpider, Rule 
from dirbot.settings import * 
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor 
from scrapy.item import Item, Field 
from urlparse import urlparse 

class MyItem(Item): 
    url= Field() 

class someSpider(CrawlSpider): 
    name = 'expired' 
    start_urls = ['http://domain.com'] 

    rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),) 

    def parse_obj(self,response): 
     item = MyItem() 
     item['url'] = [] 
     for link in LxmlLinkExtractor(allow='/.com|.fr|.net|.org|.info/i',deny = '/.jp|facebook|amazon|wordpress|blogspot|free.|google|yahoo|bing|znet|stackexchange|twitter|wikipedia/i').extract_links(response): 
      parsed_uri = urlparse(link.url) 
      url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri) 
     insert_table(url) 

回答

2

在你的代碼,你可以檢查響應代碼如下:

class someSpider(CrawlSpider): 
name = 'expired' 
start_urls = ['http://domain.com'] 

rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),) 

def parse_obj(self,response): 
    item = MyItem() 
    item['url'] = [] 
    if response.status == 404: 
     # Do if not available 
     pass 
    elif response.status == 200: 
     # Do if OK 
     insert_table(url) 
     for link in LxmlLinkExtractor(allow='/.com|.fr|.net|.org|.info/i',deny = '/.jp|facebook|amazon|wordpress|blogspot|free.|google|yahoo|bing|znet|stackexchange|twitter|wikipedia/i').extract_links(response): 
      parsed_uri = urlparse(link.url) 
      url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri) 

    elif response.status == 500: 
     # Do if server crash 
     pass 

我加入了代碼解析的情況下,該網站鏈接的網站初始請求給你一個http 200 OK響應碼。

我希望它有幫助。