2017-01-17 100 views
0

我的HTML代碼中包含了一些與大部分同類結構的div ...以下是包含2周這樣的divScrapy條件爬行

<!-- 1st Div start --> 

<div class="outer-container"> 
<div class="inner-container"> 
<a href="www.xxxxxx.com"></a> 
<div class="abc xyz" title="verified"></div> 
<div class="mody"> 
     <div class="row"> 
      <div class="col-md-5 col-xs-12"> 
       <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2> 
       <div class="mvsdfm casmhrn" itemprop="address"> 
        <span itemprop="Address">1223 Industrial Blvd</span><br> 
        <span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span> 
       </div> 
       <div class="hidden-device-xs" itemprop="phone" rel="mainPhone"> 
        (800) 845-0000 
       </div> 
      </div> 
     </div> 
    </div> 
</div> 
</div> 

<!-- 2nd Div start --> 

<div class="outer-container"> 
<div class="inner-container"> 
<a href="www.yyyyyy.com"></a> 
<div class="mody"> 
     <div class="row"> 
      <div class="col-md-5 col-xs-12"> 
       <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2> 
       <div class="mvsdfm casmhrn" itemprop="address"> 
        <span itemprop="Address">7890 Business St</span><br> 
        <span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span> 
       </div> 
       <div class="hidden-device-xs" itemprop="phone" rel="mainPhone"> 
        (800) 845-0000 
       </div> 
      </div> 
     </div> 
    </div> 
</div> 
</div> 

所以這裏的代碼片段是我想Scrapy做 .. 。

如果類=「外容器」的div包含另一個DIV與標題=在第一格上面的「驗證」一樣,它應該去的URL上面(即w^ww.xxxxxx.com)並在該頁面上獲取其他一些場景。

如果存在包含標題無DIV =「驗證」,如上面第二DIV,應該下DIV類=「麼」取所有數據。即公司名稱(Fat Dude,LLC),地址,城市,州等...並且不遵循網址(即www.yyyyy.com)

那麼我如何在Scrapy爬行器中應用這個條件/邏輯。我在考慮使用BeautifulSoup的,但不知道....

有什麼我試過到目前爲止....

class MySpider(CrawlSpider): 
    name = 'dknfetch' 
    start_urls = ['http://www.xxxxxx.com/scrapy/all-listing'] 
    allowed_domains = ['www.xxxxx.com'] 
    def parse(self, response): 
      hxs = Selector(response) 
      soup = BeautifulSoup(response.body, 'lxml') 
      nf = NewsFields() 
      cName = soup.find_all("a", class_="mheading primary h4") 
      addrs = soup.find_all("span", itemprop_="Address") 
      loclity = soup.find_all("span", itemprop_="Locality") 
      region = soup.find_all("span", itemprop_="Region") 
      post = soup.find_all("span", itemprop_="postalCode") 

      nf['companyName'] = cName[0]['content'] 
      nf['address'] = addrs[0]['content'] 
      nf['locality'] = loclity[0]['content'] 
      nf['state'] = region[0]['content'] 
      nf['zipcode'] = post[0]['content'] 
      yield nf 
      for url in hxs.xpath('//div[@class="inner-container"]/a/@href').extract(): 
      yield Request(url, callback=self.parse) 

Ofcourse,上面的代碼返回並抓取的所有網址的下DIV CLASS =「內部容器」因爲沒有在此代碼爬行規定,監守我不知道在哪裏/如何設置條件。

如果有人之前做類似的事情,請大家分享。由於

回答

0

無需使用BeautifulSoup,Scrapy,用它自己的選擇能力(也分別發佈了作爲parsel)。讓我們用你的HTML做出了榜樣:

html = u""" 
<!-- 1st Div start --> 
<div class="outer-container"> 
<div class="inner-container"> 
<a href="www.xxxxxx.com"></a> 
<div class="abc xyz" title="verified"></div> 
<div class="mody"> 
     <div class="row"> 
      <div class="col-md-5 col-xs-12"> 
       <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2> 
       <div class="mvsdfm casmhrn" itemprop="address"> 
        <span itemprop="Address">1223 Industrial Blvd</span><br> 
        <span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span> 
       </div> 
       <div class="hidden-device-xs" itemprop="phone" rel="mainPhone"> 
        (800) 845-0000 
       </div> 
      </div> 
     </div> 
    </div> 
</div> 
</div> 
<!-- 2nd Div start --> 
<div class="outer-container"> 
<div class="inner-container"> 
<a href="www.yyyyyy.com"></a> 
<div class="mody"> 
     <div class="row"> 
      <div class="col-md-5 col-xs-12"> 
       <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2> 
       <div class="mvsdfm casmhrn" itemprop="address"> 
        <span itemprop="Address">7890 Business St</span><br> 
        <span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span> 
       </div> 
       <div class="hidden-device-xs" itemprop="phone" rel="mainPhone"> 
        (800) 845-0000 
       </div> 
      </div> 
     </div> 
    </div> 
</div> 
</div> 
""" 

from parsel import Selector 
sel = Selector(text=html) 
for div in sel.css('.outer-container'): 
    if div.css('div[title="verified"]'): 
     url = div.css('a::attr(href)').extract_first() 
     print 'verified, follow this URL:', url 
    else: 
     nf = {} 
     nf['companyName'] = div.xpath('string(.//h2)').extract_first() 
     nf['address'] = div.css('span[itemprop="Address"]::text').extract_first() 
     nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first() 
     nf['state'] = div.css('span[itemprop="Region"]::text').extract_first() 
     nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first() 
     print 'not verified, extracted item is:', nf 

前一個片斷的結果是:

verified, follow this URL: www.xxxxxx.com 
not verified, extracted item is: {'zipcode': u'987655', 'state': u'MA', 'address': u'7890 Business St', 'locality': u'Tokyo', 'companyName': u'Fat Dude, LLC'} 

但Scrapy你甚至都不需要實例化Selector類,有捷徑傳遞給回調函數的response對象中可用。此外,你不應該繼承CrawlSpider,只是普通Spider類是不夠的。全部放在一起:

from scrapy import Spider, Request 
from myproject.items import NewsFields 

class MySpider(Spider): 
    name = 'dknfetch' 
    start_urls = ['http://www.xxxxxx.com/scrapy/all-listing'] 
    allowed_domains = ['www.xxxxx.com'] 

    def parse(self, response): 
     for div in response.selector.css('.outer-container'): 
      if div.css('div[title="verified"]'): 
       url = div.css('a::attr(href)').extract_first() 
       yield Request(url) 
      else: 
       nf = NewsFields() 
       nf['companyName'] = div.xpath('string(.//h2)').extract_first() 
       nf['address'] = div.css('span[itemprop="Address"]::text').extract_first() 
       nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first() 
       nf['state'] = div.css('span[itemprop="Region"]::text').extract_first() 
       nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first() 
       yield nf 

我建議你獲得familar與Parsel的API:https://parsel.readthedocs.io/en/latest/usage.html

刮快樂!