這是我的工作代碼:Scrapy不抓取
from scrapy.item import Item, Field
class Test2Item(Item):
title = Field()
from scrapy.http import Request
from scrapy.conf import settings
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class Khmer24Spider(CrawlSpider):
name = 'khmer24'
allowed_domains = ['www.khmer24.com']
start_urls = ['http://www.khmer24.com/']
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1"
DOWNLOAD_DELAY = 2
rules = (
Rule(SgmlLinkExtractor(allow=r'ad/.+/67-\d+\.html'), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
i = Test2Item()
i['title'] = (hxs.select(('//div[@class="innerbox"]/h1/text()')).extract()[0]).strip(' \t\n\r')
return i
它可以放棄只有10或15的記錄。總是隨機數字!我不能設法讓所有頁面具有像http://www.khmer24.com/ad/any-words/67-anynumber.html
我真的懷疑Scrapy完成爬行,因爲重複請求。他們建議使用dont_filter = True
然而,我不知道把它放在我的代碼中。
我是Scrapy的新手,真的需要幫助。
idk如果這是相關的,但有許多affilliate連接在那裏在做JavaScript重定向 – dm03514 2013-03-04 17:20:48