scrapy：避免循環再爬行

我建立一個刮板的旅遊景點附近的酒店在TripAdvisor的，刮板將解析的網址是這樣的：http://www.tripadvisor.com/AttractionsNear-g55711-d1218038-oa30-Dallas_Addison_Marriott_Quorum_By_the_Galleria-Dallas_Texas.html scrapy：避免循環再爬行

我寫了兩個規則，以獲得這些URL，第二個是對於目標網址下一景點頁：

Rule(SgmlLinkExtractor(allow=(".*AttractionsNear-g.*",), 
          restrict_xpaths=('.//div[@class="nearby_links wrap"]/a',), unique=True), 
     callback='parse_item', follow=True), 
    Rule(SgmlLinkExtractor(allow=(".*AttractionsNear-g.*",), 
          restrict_xpaths=('.//div[@class="pgLinks"]/a[contains(@class, "pageNext")]',), unique=True), 
     callback='parse_item', follow=True),

但在我的目標網址的第一條規則是有效的，和刮刀將重新抓取已解析的網址，並開始從開始的過程。

我試圖通過DownloaderMiddleware

class LocationsDownloaderMiddleware(object): 
def process_request(self, request, spider): 
    if(request.url.encode('ascii', errors='ignore') in deny_domains): 
     return IgnoreRequest() 
    else: return None

，並通過管理deny_domains避免圓形爬行列表 - 在解析

def parse_item(self, response): 
    deny_domains.append(response.url.encode('ascii', errors='ignore'))

但現在這個中間件阻塞每個URL我想解析響應。

我該如何管理它？謝謝

來源

2015-07-20 talmosko

SgmlLinkExtractor已停產，您應該改用scrapy.linkextractors.LinkExtractor。

現在你的規則應該是這樣的：

rules = (
    Rule(
     LinkExtractor(
      restrict_xpaths=['xpath_to_category'], 
      allow=('regex_for_links') 
     ), 
     follow=True, 
    ), 
    Rule(
     LinkExtractor(
      restrict_xpaths=['xpath_to_items'], 
      allow=('regex_to_links') 
     ), 
     callback='some_parse_method', 
    ), 
)

當您指定follow=True這意味着你沒有使用callback，而是你只是指定這些鏈接應該是「跟隨」和規則仍然應用。您可以查看文檔here。

此外，它不會重複請求，因爲scrapy正在過濾它。

來源

2015-07-21 03:26:41 eLRuLL

在我的情況下，我需要回調，並一起關注，因爲有我解析的頁面，並希望繼續到下一頁。如果scrapy正在過濾我的請求，則沒有問題。謝謝 – talmosko

scrapy：避免循環再爬行

回答

相關問題