Scrapy不抓取

所有頁面

這是我的工作代碼：Scrapy不抓取

from scrapy.item import Item, Field 

class Test2Item(Item): 
    title = Field() 

from scrapy.http import Request 
from scrapy.conf import settings 
from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders import CrawlSpider, Rule 

class Khmer24Spider(CrawlSpider): 
    name = 'khmer24' 
    allowed_domains = ['www.khmer24.com'] 
    start_urls = ['http://www.khmer24.com/'] 
    USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1" 
    DOWNLOAD_DELAY = 2 

    rules = (
     Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item', follow=True), 
    ) 

    def parse_item(self, response): 
     hxs = HtmlXPathSelector(response) 
     i = Test2Item() 
     i['title'] = (hxs.select(('//div[@class="innerbox"]/h1/text()')).extract()[0]).strip(' \t\n\r') 
     return i

它可以放棄只有10或15的記錄。總是隨機數字！我不能設法讓所有頁面具有像http://www.khmer24.com/ad/any-words/67-anynumber.html

我真的懷疑Scrapy完成爬行，因爲重複請求。他們建議使用dont_filter = True然而，我不知道把它放在我的代碼中。

我是Scrapy的新手，真的需要幫助。

來源

2013-02-28 Vicheanak

idk如果這是相關的，但有許多affilliate連接在那裏在做JavaScript重定向 – dm03514 2013-03-04 17:20:48

1.「他們建議使用dont_filter = True但是，我不知道把它放在我的代碼中。

此參數位於CrawlSpider繼承自的BaseSpider中。（scrapy/spider.py）默認設置爲True。

2.「它只能報廢10或15條記錄。」

原因：這是因爲start_urls並不好。在這個問題中，蜘蛛開始在http://www.khmer24.com/中爬行，並且假設它會得到10個url（它滿足了這個模式）。然後，蜘蛛繼續爬行這10個網站。但是由於這些頁面包含的模式很少，所以蜘蛛會得到一些網址（甚至沒有網址），從而導致爬行停止。

可能的解決辦法：原因就是我上面說只是重申icecrime的意見。解決方案也是如此。

建議使用「所有的廣告頁面爲start_urls。（你也可以使用的主頁爲start_urls並使用新的規則）

新規則：

rules = (
    # Extract all links and follow links from them 
    # (since no callback means follow=True by default) 
    # (If "allow" is not given, it will match all links.) 
    Rule(SgmlLinkExtractor()), 

    # Extract links matching the "ad/any-words/67-anynumber.html" pattern 
    # and parse them with the spider's method parse_item (NOT FOLLOW THEM) 
    Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item'), 
)

參見： SgmlLinkExtractor， CrawlSpider example

來源

2013-03-05 05:46:39

嘿，那裏，當我運行這個代碼，它爬網站中的每一個頁面。但是，我希望它基於我僅設置的規則進行爬網。 – Vicheanak 2013-03-07 03:44:26

是否要抓取整個網站並獲取與該模式相匹配的所有網址？ – 2013-03-07 05:35:40

與該模式相匹配的網址http://www.khmer24.com/ad/any-words/67-anynumber-or-words.html – Vicheanak 2013-03-07 06:43:23

回答

相關問題