提取3級內容與Scrapy

我有一個種子URL（比如DOMAIN/manufacturers.php），無分頁，看起來像這樣：提取3級內容與Scrapy

<!DOCTYPE html> 
<html> 
<head> 
    <title></title> 
</head> 

<body> 
    <div class="st-text"> 
     <table cellspacing="6" width="600"> 
      <tr> 
       <td> 
        <a href="manufacturer1-type-59.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer1-type-59.php">Name 1</a> 
       </td> 

       <td> 
        <a href="manufacturer2-type-5.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer2-type-5.php">Name 2</a> 
       </td> 
      </tr> 

      <tr> 
       <td> 
        <a href="manufacturer3-type-88.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer3-type-88.php">Name 3</a> 
       </td> 

       <td> 
        <a href="manufacturer4-type-76.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer4-type-76.php">Name 4</a> 
       </td> 
      </tr> 

      <tr> 
       <td> 
        <a href="manufacturer5-type-28.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer5-type-28.php">Name 5</a> 
       </td> 

       <td> 
        <a href="manufacturer6-type-48.php"></a> 
       </td> 

       <td> 
        <a href="manufacturer6-type-48.php">Name 6</a> 
       </td> 
      </tr> 
     </table> 
    </div> 
</body> 
</html>

從那裏，我想獲得的所有a['href'] 's，例如： manufacturer1-type-59.php。請注意，這些鏈接不包含DOMAIN前綴，所以我的猜測是我必須以某種方式添加它，或者可能不是？

或者，我想保留memory（用於下一個階段）的鏈接，並將它們保存到disk以備將來參考。

每個環節的內容，如manufacturer1-type-59.php，看起來是這樣的：

<!DOCTYPE html> 
<html> 
<head> 
    <title></title> 
</head> 

<body> 
    <div class="makers"> 
     <ul> 
      <li> 
       <a href="manufacturer1_model1_type1.php"></a> 
      </li> 

      <li> 
       <a href="manufacturer1_model1_type2.php"></a> 
      </li> 

      <li> 
       <a href="manufacturer1_model2_type3.php"></a> 
      </li> 
     </ul> 
    </div> 

    <div class="nav-band"> 
     <div class="nav-items"> 
      <div class="nav-pages"> 
       <span>Pages:</span><strong>1</strong> 
       <a href="manufacturer1-type-STRING-59-INT-p2.php">2</a> 
       <a href="manufacturer1-type-STRING-59-INT-p3.php">3</a> 
       <a href="manufacturer1-type-STRING-59-INT-p2.php" title="Next page">»</a> 
      </div> 
     </div> 
    </div> 
</body> 
</html>

接下來，我想獲得的所有a['href'] 's，例如manufacturer_model1_type1.php。再次請注意，這些鏈接不包含域前綴。這裏有一個額外的困難是這些頁面支持分頁。所以，我也想進入所有這些頁面。正如所料，manufacturer-type-59.php重定向到manufacturer-type-STRING-59-INT-p2.php。

（可選）我也想保留memory（用於下一個階段）的鏈接並將它們保存到disk以備將來參考。

第三步也是最後一步應該是檢索manufacturer_model1_type1.php類型的所有頁面的內容，提取標題並將結果保存在以下格式的文件中：（url，title，）。

編輯

這是我迄今所做的，但似乎並沒有工作...

import scrapy 

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors import LinkExtractor 

class ArchiveItem(scrapy.Item): 
    url = scrapy.Field() 

class ArchiveSpider(CrawlSpider): 
    name = 'gsmarena' 
    allowed_domains = ['gsmarena.com'] 
    start_urls = ['http://www.gsmarena.com/makers.php3'] 
    rules = [ 
     Rule(LinkExtractor(allow=['\S+-phones-\d+\.php'])), 
     Rule(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])), 
     Rule(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'), 
    ] 

    def parse_archive(self, response): 
     torrent = ArchiveItem() 
     torrent['url'] = response.url 
     return torrent

來源

2015-04-08 user706838

它是一個公共站點，你可以分享的網址？（有助於幫助） – alecxe

當然，上面的例子是基於以下種子URL（http://www.gsmarena.com/makers.php3）。然而，對我來說最重要的是要理解這個潛意識。儘管如此，如果你能給我一個工作的例子，我會更容易理解所有這些概念。 :) – user706838

嗨@alecxe我剛剛添加了迄今爲止我嘗試過的東西（雖然它不起作用 - 但！）。你能看看嗎？謝謝！ – user706838

我想你更好地使用，而不是CrawlSpider

這BaseSpider代碼可能有幫助

class GsmArenaSpider(Spider): 
    name = 'gsmarena' 
    start_urls = ['http://www.gsmarena.com/makers.php3', ] 
    allowed_domains = ['gsmarena.com'] 
    BASE_URL = 'http://www.gsmarena.com/' 

def parse(self, response): 
    markers = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract() 
    if markers: 
     for marker in markers: 
      yield Request(url=self.BASE_URL + marker, callback=self.parse_marker) 

def parse_marker(self, response): 
    url = response.url 
    # extracting phone urls 
    phones = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract() 
    if phones: 
     for phone in phones: 
      # change callback function name as parse_events for first crawl 
      yield Request(url=self.BASE_URL + phone, callback=self.parse_phone) 
    else: 
     return 

    # pagination 
    next_page = response.xpath('//a[contains(@title, "Next page")]/@href').extract() 
    if next_page: 
     yield Request(url=self.BASE_URL + next_page[0], callback=self.parse_marker) 

def parse_phone(self, response): 
    # extract whatever stuffs you want and yield items here 
    pass

編輯，如果你想保持從那裏，這些手機的URL來了，你可以從解析通過URL作爲元軌道到parse_phone通過parse_marker 那麼這個請求看起來

像

yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url}) 

yield Request(url=self.BASE_URL + phone, callback=self.parse_phone, meta={'url_level2': response.url, url_level1: response.meta['url_level1']})

來源

2015-04-09 04:18:40 Jithin

嗨！你爲什麼贊成'Spider'而不是'CrawlSpider'的任何特殊原因？順便說一句，我上面的解決方案看起來像確定，但我也想使用代理。有任何想法嗎？ – user706838

區別抓取和基地蜘蛛參考[here]（http://doc.scrapy.org/en/latest/topics/spiders.html）您可以使用代理中間件[代理中間件]（http：// stackoverflow .com/questions/20792152/setting-scrapy-proxy-middleware-to-rotate-on-each-request） – Jithin

感謝關於'proxies'的鏈接;我發現很難理解工作是如何的 - 但是！理想情況下，我想要做的是給出一個ip：port列表，例如[ip1：port1，ip2：port2，ip3：port3]和讓'CrawlSpider'一次選擇一個（隨機）。請注意，我仍然想使用'CrawlSpider'。所以，我該怎麼做？ – user706838

提取3級內容與Scrapy

回答

相關問題