Scrapy蜘蛛錯誤處理

我是新來編程python和使用scrapy。我正在抓取網頁，然後將收藏保存到mongoDB。我在抓取網絡時遇到了一個錯誤。我在這個網站上使用過類似的幫助頁面，甚至跟着一個教程從頭到尾無濟於事，任何幫助將不勝感激。Scrapy蜘蛛錯誤處理

This is the error i'm getting from terminal, Spider error processing

這裏是我的代碼：

from scrapy.item import Item, Field 

#class 1 
class StackItem(Item): 
# define the fields for your item here like: 
# name = scrapy.Field() 
pagetitle = Field() 
newsmain = Field() 
pass 

from scrapy import Spider 
from scrapy.selector import Selector 
from stack.items import StackItem 

#class 2 
class StackSpider(Spider): 
name = "stack" 
allowed_domains = ["docs.python.org"] 
start_urls = ["https://docs.python.org/2/howto/curses.html",] 

def parse(self, response): 
    information = Selector(response.body).xpath('//div[@class="section"]') 

    for data in information: 
     item = StackItem() 
     item['pagetitle'] = data.information('//*[@id="curses-programming- with-python"]').extract() 
     item['newsmain'] = data.information('//*[@id="what-is- curses"]').extract() 

    yield item

來源

2016-11-20 Daniel

你能修復你的代碼粘貼的縮進嗎？ –

scrapy.selector.Selector.__init__()expects a Response object as first argument。

如果你想建立一個HTTP響應主體選擇，使用text=參數：

$ scrapy shell https://docs.python.org/2/howto/curses.html 
2016-11-21 11:05:34 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot) 
(...) 
2016-11-21 11:05:35 [scrapy] INFO: Spider opened 
2016-11-21 11:05:35 [scrapy] DEBUG: Crawled (200) <GET https://docs.python.org/2/howto/curses.html> (referer: None) 
(...) 
>>> 
>>> # 
>>> # passing response.body (bytes) instead of a Response object fails 
>>> # 
>>> scrapy.Selector(response.body) 
Traceback (most recent call last): 
    File "<console>", line 1, in <module> 
    File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/selector/unified.py", line 67, in __init__ 
    text = response.text 
AttributeError: 'str' object has no attribute 'text' 
>>> 
>>> # 
>>> # use text= argument to pass response body 
>>> # 
>>> scrapy.Selector(text=response.body) 
<Selector xpath=None data=u'<html xmlns="http://www.w3.org/1999/xhtm'> 
>>> 
>>> scrapy.Selector(text=response.body).xpath('//div[@class="section"]') 
[<Selector xpath='//div[@class="section"]' data=u'<div class="section" id="curses-programm'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="what-is-curses"'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="the-python-curs'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="starting-and-en'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="windows-and-pad'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="displaying-text'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="attributes-and-'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="user-input">\n<h'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="for-more-inform'>] 
>>>

更簡單的方法是通過直接響應對象：

>>> scrapy.Selector(response).xpath('//div[@class="section"]') 
[<Selector xpath='//div[@class="section"]' data=u'<div class="section" id="curses-programm'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="what-is-curses"'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="the-python-curs'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="starting-and-en'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="windows-and-pad'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="displaying-text'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="attributes-and-'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="user-input">\n<h'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="for-more-inform'>]

和偶更簡單的方法是使用.xpath() method on the response instance（這是一種方便的方法，可以爲您創建一個選擇器），前提是您的回覆爲HtmlResponse或XmlResponse（這通常適用於網頁抓取）

>>> response.xpath('//div[@class="section"]') 
[<Selector xpath='//div[@class="section"]' data=u'<div class="section" id="curses-programm'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="what-is-curses"'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="the-python-curs'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="starting-and-en'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="windows-and-pad'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="displaying-text'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="attributes-and-'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="user-input">\n<h'>, <Selector xpath='//div[@class="section"]' data=u'<div class="section" id="for-more-inform'>]

來源

2016-11-21 10:13:11

@paultmbrth謝謝你的回覆我現在會嘗試你的建議，我也沒有意識到有一個縮進問題，我會記下下次 – Daniel

我剛剛嘗試了這個建議，它不起作用，我確定它的如果可以的話，你是否有任何建議，我將不得不對我的原始代碼做出這些改變？...我看到你正在使用shell，我嘗試運行shell來幫助我，但是， im患有_curses進口商的恐慌 – Daniel

Scrapy蜘蛛錯誤處理

回答

相關問題