2013-05-26 80 views
2

我試圖刮scrape craigslist使用scrapy,並已成功獲取網址,但現在我想要從網頁中的網頁提取數據。以下是代碼:scrapy使用scrapy

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from craigslist.items import CraigslistItem 

class craigslist_spider(BaseSpider): 
    name = "craigslist_unique" 
    allowed_domains = ["craiglist.org"] 
    start_urls = [ 
     "http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time", 
     "http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship", 
    "http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time" 
    ] 


def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select("//span[@class='pl']") 
    items = [] 
    for site in sites: 
     item = CraigslistItem() 
     item['title'] = site.select('a/text()').extract() 
     item['link'] = site.select('a/@href').extract() 
    #item['desc'] = site.select('text()').extract() 
     items.append(item) 
    hxs = HtmlXPathSelector(response) 
    #print title, link   
    return items 

我是新來scrapy和無法弄清楚至於如何真正打URL(HREF)和URL的網頁中獲取數據,並在這樣做了所有的URL。的start_urls

+1

由於您正在抓取,請使用'CrawlSpider'。閱讀文檔中的幾個示例。 – Blender

回答

1

響應在parse方法

收到一個接一個,如果你只是想抓住從start_urls迴應你的代碼幾乎是OK的信息。但是你的解析方法應該在你的craigslist_spider類中,而不是在那個類的外面。

def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select("//span[@class='pl']") 
    items = [] 
    for site in sites: 
     item = CraigslistItem() 
     item['title'] = site.select('a/text()').extract() 
     item['link'] = site.select('a/@href').extract() 
     items.append(item) 
    #print title, link 
    return items 

,如果你想從start_urls從anchor上存在start_urls響應信息的一半,一半是什麼?

def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select("//span[@class='pl']") 
    for site in sites: 
     item = CraigslistItem() 
     item['title'] = site.select('a/text()').extract() 
     item['link'] = site.select('a/@href').extract() 
     if item['link']: 
      if 'http://' not in item['link']: 
       item['link'] = urljoin(response.url, item['link']) 
      yield Request(item['link'], 
          meta={'item': item}, 
          callback=self.anchor_page) 


def anchor_page(self, response): 
    hxs = HtmlXPathSelector(response) 
    old_item = response.request.meta['item'] # Receiving parse Method item that was in Request meta 
    # parse some more values 
    #place them in old_item 
    #e.g 
    old_item['bla_bla']=hxs.select("bla bla").extract() 
    yield old_item 

你只需要在yield Request解析方法和使用metaRequest

的然後提取old_itemanchor_page在其添加新的價值觀和簡單地得到它運送您old item

0

xpaths存在問題 - 它們應該是相對的。下面的代碼:

from scrapy.item import Item, Field 
from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 


class CraigslistItem(Item): 
    title = Field() 
    link = Field() 


class CraigslistSpider(BaseSpider): 
    name = "craigslist_unique" 
    allowed_domains = ["craiglist.org"] 
    start_urls = [ 
     "http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time", 
     "http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship", 
     "http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time" 
    ] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select("//span[@class='pl']") 
     items = [] 
     for site in sites: 
      item = CraigslistItem() 
      item['title'] = site.select('.//a/text()').extract()[0] 
      item['link'] = site.select('.//a/@href').extract()[0] 
      items.append(item) 
     return items 

如果通過運行它:

scrapy runspider spider.py -o output.json 

你會看到output.json:

{"link": "/sby/sof/3824966457.html", "title": "HR Admin/Tech Recruiter"} 
{"link": "/eby/sof/3824932209.html", "title": "Entry Level Web Developer"} 
{"link": "/sfc/sof/3824500262.html", "title": "Sr. Ruby on Rails Contractor @ Funded Startup"} 
... 

希望有所幫助。