所以我試圖刮掉從新聞網站,具有無限的渦旋式佈局的文章所以下面會發生什麼:關於Scrapy重新定向行爲的混淆?
example.com
有文章第一頁
example.com/page/2/
有第二頁
example.com/page/3/
有第三頁
依此類推。當您向下滾動時,網址會發生變化。爲了說明這一點,我想湊第一x
許多文章和做了以下內容:
start_urls = ['http://example.com/']
for x in range(1,x):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
這似乎今年前9頁正常工作,我得到類似如下:
Redirecting (301) to <GET http://example.com/page/4/> from <GET http://www.example.com/page/4/>
Redirecting (301) to <GET http://example.com/page/5/> from <GET http://www.example.com/page/5/>
Redirecting (301) to <GET http://example.com/page/6/> from <GET http://www.example.com/page/6/>
Redirecting (301) to <GET http://example.com/page/7/> from <GET http://www.example.com/page/7/>
2017-09-08 17:36:23 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
Redirecting (301) to <GET http://example.com/page/8/> from <GET http://www.example.com/page/8/>
Redirecting (301) to <GET http://example.com/page/9/> from <GET http://www.example.com/page/9/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/10/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/11/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/12/>
Redirecting (301) to <GET http://www.example.com/> from <GET http://www.example.com/page/13/>
從第10頁開始,它會從example.com/page/10/
重定向到example.com/
這樣的頁面,而不是原始鏈接example.com/page/10
。什麼會導致這種行爲?
我看了幾個選項,如dont_redirect
,但我只是不明白髮生了什麼事。什麼可能是這種重新定向行爲的原因?特別是因爲當你直接輸入像example.com/page/10
這樣的網站鏈接時不會發生重定向?
任何幫助將不勝感激,謝謝!
[編輯]
class spider(CrawlSpider):
start_urls = ['http://example.com/']
for x in range(startPage,endPage):
new_url = 'http://www.example.com/page/' + str(x) +'/'
start_urls.append(new_url)
custom_settings = {'DEPTH_PRIORITY': 1, 'DEPTH_LIMIT': 1}
rules = (
Rule(LinkExtractor(allow=('some regex here,')deny=('example\.com/page/.*','some other regex',),callback='parse_article'),
)
def parse_article(self, response):
#some parsing work here
yield item
是不是因爲我包括在LinkExtractor
example\.com/page/.*
?不應該只適用於不是start_url
的鏈接嗎?
您是否因爲該頁面不存在而被重定向?你想要抓什麼網站? – Bricky
你可以發佈你的實際代碼的最小例子嗎? – Bricky
@Bricky我無法發佈詳細信息,但我已更新問題以包含任何相關內容,謝謝! – ocean800