2015-05-01 58 views
0

我有下面的代碼,它從網站上抓取所有可用的頁面。這非常「抓取」有效頁面,因爲當我使用打印功能時 - 我可以看到來自「items」列表的數據,但是當我嘗試使用`.csv`作爲目的地時看不到任何輸出文件來轉儲統計信息。 (在命令提示符下使用這個命令:`scrapy crawl craig -o test.csv -t csv`),.. 請幫我輸出數據到`csv`文件。如何使用scrapy使用python將多個網頁抓取的數據輸出到csv文件中

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.exceptions import CloseSpider 
from scrapy.http import Request 
from test.items import CraigslistSampleItem 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 

URL = "http://example.com/subpage/%d" 


class MySpider(BaseSpider): 
    name = "craig" 
    allowed_domains = ["xyz.com"] 

    #for u in URL: 
    start_urls = [URL % 1] 

    def __init__(self): 
     self.page_number = 1 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     titles = hxs.select("//div[@class='thumb']") 
     if not titles: 
      raise CloseSpider('No more pages') 
     items = [] 
     for titles in titles: 
      item = CraigslistSampleItem() 
      item ["title"] = titles.select("a/@title").extract() 
      item ["url"] = titles.select("a/@href").extract() 
      items.append(item) 
     yield items 


     self.page_number += 1 
     yield Request(URL % self.page_number) 
+1

'在titles'標題看起來像一個錯字 –

+0

而且我不明白爲什麼你創建一個項目列表。爲什麼不直接在循環結束時「屈服項目」? –

+0

謝謝馬丁..如果我不使用項目列表,我不知道如何獲得所有從網頁上刮下的項目?,這是我使用項目列表的唯一原因..我嘗試沒有追加的收益項目,由這樣做 - 它只產生一個從網頁上刮下的物品(實際上有幾個物品)..請糾正我的情況如果我做錯了.. – user3128771

回答

0
from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.exceptions import CloseSpider 
from scrapy.http import Request 
from test.items import CraigslistSampleItem 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 

URL = "http://example.com/subpage/%d" 


class MySpider(BaseSpider): 
    name = "craig" 
    allowed_domains = ["xyz.com"] 

    def start_requests(self): 
     for i in range(10): 
      yield Request(URL % i, callback=self.parse) 

    def parse(self, response): 
     titles = response.xpath("//div[@class='thumb']") 
     if not titles: 
      raise CloseSpider('No more pages') 
     for title in titles: 
      item = CraigslistSampleItem() 
      item ["title"] = title.xpath("./a/@title").extract() 
      item ["url"] = title.xpath("./a/@href").extract() 
      yield item 
+0

嗨,非常感謝您的回答!很有幫助。 – user3128771

相關問題