2015-04-08 66 views
1

我有一個種子URL(比如DOMAIN/manufacturers.php),無分頁,看起來像這樣:提取3級內容與Scrapy

<!DOCTYPE html> 
<html> 
<head> 
    <title></title> 
</head> 

<body> 
    <div class="st-text"> 
     <table cellspacing="6" width="600"> 
      <tr> 
       <td> 
        <a href="manufacturer1-type-59.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer1-type-59.php">Name 1</a> 
       </td> 

       <td> 
        <a href="manufacturer2-type-5.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer2-type-5.php">Name 2</a> 
       </td> 
      </tr> 

      <tr> 
       <td> 
        <a href="manufacturer3-type-88.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer3-type-88.php">Name 3</a> 
       </td> 

       <td> 
        <a href="manufacturer4-type-76.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer4-type-76.php">Name 4</a> 
       </td> 
      </tr> 

      <tr> 
       <td> 
        <a href="manufacturer5-type-28.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer5-type-28.php">Name 5</a> 
       </td> 

       <td> 
        <a href="manufacturer6-type-48.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer6-type-48.php">Name 6</a> 
       </td> 
      </tr> 
     </table> 
    </div> 
</body> 
</html> 

從那裏,我想獲得的所有a['href'] 's,例如: manufacturer1-type-59.php。請注意,這些鏈接不包含DOMAIN前綴,所以我的猜測是我必須以某種方式添加它,或者可能不是?

或者,我想保留memory(用於下一個階段)的鏈接,並將它們保存到disk以備將來參考。

每個環節的內容,如manufacturer1-type-59.php,看起來是這樣的:

<!DOCTYPE html> 
<html> 
<head> 
    <title></title> 
</head> 

<body> 
    <div class="makers"> 
     <ul> 
      <li> 
       <a href="manufacturer1_model1_type1.php"></a> 
      </li> 

      <li> 
       <a href="manufacturer1_model1_type2.php"></a> 
      </li> 

      <li> 
       <a href="manufacturer1_model2_type3.php"></a> 
      </li> 
     </ul> 
    </div> 

    <div class="nav-band"> 
     <div class="nav-items"> 
      <div class="nav-pages"> 
       <span>Pages:</span><strong>1</strong> 
       <a href="manufacturer1-type-STRING-59-INT-p2.php">2</a> 
       <a href="manufacturer1-type-STRING-59-INT-p3.php">3</a> 
       <a href="manufacturer1-type-STRING-59-INT-p2.php" title="Next page">»</a> 
      </div> 
     </div> 
    </div> 
</body> 
</html> 

接下來,我想獲得的所有a['href'] 's,例如manufacturer_model1_type1.php。再次請注意,這些鏈接不包含域前綴。這裏有一個額外的困難是這些頁面支持分頁。所以,我也想進入所有這些頁面。正如所料,manufacturer-type-59.php重定向到manufacturer-type-STRING-59-INT-p2.php

(可選)我也想保留memory(用於下一個階段)的鏈接並將它們保存到disk以備將來參考。

第三步也是最後一步應該是檢索manufacturer_model1_type1.php類型的所有頁面的內容,提取標題並將結果保存在以下格式的文件中:(url,title,)。

編輯

這是我迄今所做的,但似乎並沒有工作...

import scrapy 

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors import LinkExtractor 

class ArchiveItem(scrapy.Item): 
    url = scrapy.Field() 

class ArchiveSpider(CrawlSpider): 
    name = 'gsmarena' 
    allowed_domains = ['gsmarena.com'] 
    start_urls = ['http://www.gsmarena.com/makers.php3'] 
    rules = [ 
     Rule(LinkExtractor(allow=['\S+-phones-\d+\.php'])), 
     Rule(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])), 
     Rule(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'), 
    ] 

    def parse_archive(self, response): 
     torrent = ArchiveItem() 
     torrent['url'] = response.url 
     return torrent 
+0

它是一個公共站點,你可以分享的網址? (有助於幫助) – alecxe

+0

當然,上面的例子是基於以下種子URL(http://www.gsmarena.com/makers.php3)。然而,對我來說最重要的是要理解這個潛意識。儘管如此,如果你能給我一個工作的例子,我會更容易理解所有這些概念。 :) – user706838

+0

嗨@alecxe我剛剛添加了迄今爲止我嘗試過的東西(雖然它不起作用 - 但!)。你能看看嗎?謝謝! – user706838

回答

2

我想你更好地使用,而不是CrawlSpider

這BaseSpider代碼可能有幫助

class GsmArenaSpider(Spider): 
    name = 'gsmarena' 
    start_urls = ['http://www.gsmarena.com/makers.php3', ] 
    allowed_domains = ['gsmarena.com'] 
    BASE_URL = 'http://www.gsmarena.com/' 

def parse(self, response): 
    markers = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract() 
    if markers: 
     for marker in markers: 
      yield Request(url=self.BASE_URL + marker, callback=self.parse_marker) 

def parse_marker(self, response): 
    url = response.url 
    # extracting phone urls 
    phones = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract() 
    if phones: 
     for phone in phones: 
      # change callback function name as parse_events for first crawl 
      yield Request(url=self.BASE_URL + phone, callback=self.parse_phone) 
    else: 
     return 

    # pagination 
    next_page = response.xpath('//a[contains(@title, "Next page")]/@href').extract() 
    if next_page: 
     yield Request(url=self.BASE_URL + next_page[0], callback=self.parse_marker) 

def parse_phone(self, response): 
    # extract whatever stuffs you want and yield items here 
    pass 

編輯,如果你想保持從那裏,這些手機的URL來了,你可以從解析通過URL作爲軌道到parse_phone通過parse_marker 那麼這個請求看起來

yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url}) 

yield Request(url=self.BASE_URL + phone, callback=self.parse_phone, meta={'url_level2': response.url, url_level1: response.meta['url_level1']}) 
+0

嗨!你爲什麼贊成'Spider'而不是'CrawlSpider'的任何特殊原因?順便說一句,我上面的解決方案看起來像確定,但我也想使用代理。有任何想法嗎? – user706838

+0

區別抓取和基地蜘蛛參考[here](http://doc.scrapy.org/en/latest/topics/spiders.html) 您可以使用代理中間件[代理中間件](http:// stackoverflow .com/questions/20792152/setting-scrapy-proxy-middleware-to-rotate-on-each-request) – Jithin

+0

感謝關於'proxies'的鏈接;我發現很難理解工作是如何的 - 但是!理想情況下,我想要做的是給出一個ip:port列表,例如[ip1:port1,ip2:port2,ip3:port3]和讓'CrawlSpider'一次選擇一個(隨機)。請注意,我仍然想使用'CrawlSpider'。所以,我該怎麼做? – user706838