2013-10-11 53 views
2

試圖修改答案this question我的問題,但不成功。Scrapy:解析列表項目到單獨的行

下面是一些例子的html代碼:

<div id="provider-region-addresses"> 
<h3>Contact details</h3> 
<h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>More information</dt> 
      <dd>North Shore Hospital</dd><dt>Physical address</dt> 
       <dd>124 Shakespeare Rd, Takapuna, Auckland 0620</dd><dt>Postal address</dt> 
       <dd>Private Bag 93503, Takapuna, Auckland 0740</dd><dt>Postcode</dt> 
       <dd>0740</dd><dt>District/town</dt> 

       <dd> 
       North Shore, Takapuna</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 486 8996</dd><dt>Fax</dt> 
       <dd>(09) 486 8342</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>Physical address</dt> 
       <dd>Helensville</dd><dt>Postal address</dt> 
       <dd>PO Box 13, Helensville 0840</dd><dt>Postcode</dt> 
       <dd>0840</dd><dt>District/town</dt> 

       <dd> 
       Rodney, Helensville</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 420 9450</dd><dt>Fax</dt> 
       <dd>(09) 420 7050</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>Physical address</dt> 
       <dd>Warkworth</dd><dt>Postal address</dt> 
       <dd>PO Box 505, Warkworth 0941</dd><dt>Postcode</dt> 
       <dd>0941</dd><dt>District/town</dt> 

       <dd> 
       Rodney, Warkworth</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 422 2700</dd><dt>Fax</dt> 
       <dd>(09) 422 2709</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>More information</dt> 
      <dd>Waitakere Hospital</dd><dt>Physical address</dt> 
       <dd>55-75 Lincoln Rd, Henderson, Auckland 0610</dd><dt>Postal address</dt> 
       <dd>Private Bag 93115, Henderson, Auckland 0650</dd><dt>Postcode</dt> 
       <dd>0650</dd><dt>District/town</dt> 

       <dd> 
       Waitakere, Henderson</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 839 0000</dd><dt>Fax</dt> 
       <dd>(09) 837 6634</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    <h2 class="toggler nohide">Auckland</h2> 
    <dl class="clear"> 
     <dt>More information</dt> 
      <dd>Hibiscus Coast Community Health Centre</dd><dt>Physical address</dt> 
       <dd>136 Whangaparaoa Rd, Red Beach 0932</dd><dt>Postcode</dt> 
       <dd>0932</dd><dt>District/town</dt> 

       <dd> 
       Rodney, Red Beach</dd><dt>Region</dt> 
       <dd>Auckland</dd><dt>Phone</dt> 
       <dd>(09) 427 0300</dd><dt>Fax</dt> 
       <dd>(09) 427 0391</dd><dt>Website</dt> 
       <dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd> 
    </dl> 
    </div> 

Search again

這是我的蜘蛛;

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from webhealth.items1 import WebhealthItem1 

class WebhealthSpider(BaseSpider): 

name = "webhealth_content1" 

download_delay = 5 

allowed_domains = ["webhealth.co.nz"] 
start_urls = [ 
    "http://auckland.webhealth.co.nz/provider/service/view/914136/" 
    ] 

def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    results = hxs.select('//*[@id="content"]/div[1]') 
    items1 = [] 
    for result in results: 
     item = WebhealthItem1() 
     item['url'] = result.select('//dl/a/@href').extract() 
     item['practice'] = result.select('//h1/text()').extract() 
     item['hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text()').extract()) 
     item['more_hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"More information")]/following-sibling::dd[1]/text()').extract()) 
     item['physical_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Physical address")]/following-sibling::dd[1]/text()').extract()) 
     item['postal_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postal address")]/following-sibling::dd[1]/text()').extract()) 
     item['postcode'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postcode")]/following-sibling::dd[1]/text()').extract()) 
     item['district_town'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"District/town")]/following-sibling::dd[1]/text()').extract()) 
     item['region'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Region")]/following-sibling::dd[1]/text()').extract()) 
     item['phone'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Phone")]/following-sibling::dd[1]/text()').extract()) 
     item['website'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Website")]/following-sibling::dd[1]/a/@href').extract()) 
     item['email'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Email")]/following-sibling::dd[1]/a/text()').extract()) 
     items1.append(item) 
    return items1 

從這裏,我怎麼分析列表中的項目上分離的線路,在名稱字段對應//h1/text()價值?目前,我在一個單元格中獲取每個Xpath項目的列表。這與我聲明Xpaths的方式有關嗎?

感謝

回答

3

首先,您使用的results = hxs.select('//*[@id="content"]/div[1]')所以

results = hxs.select('//*[@id="content"]/div[1]') 
    for result in results: 
     ... 

將循環在一個div,第一個孩子div<div id="content" class="clear">

想你需要的是環路上的每個<dl class="clear">...</dl>在此//*[@id="content"]/div[1]之內(使用//*[@id="content"]/div[@class="content"]可能會更容易)

 results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl') 

其次,在每個循環迭代,使用的是絕對XPath表達式(//div...

result.select('//div/dl/dt[contains(text(), "...")]/following-sibling::dd[1]/text()') 

這將選擇所有dd以下dt文本內容從文檔根節點開始匹配。

查看this section in Scrapy docs瞭解詳情。

您需要使用相對XPath表達式 - 代表每個dl,像dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text()./dt[contains(text(), "Contact hours")]/following-sibling::dd[1]/text()每個result範圍內相對,

的「實踐」領域然而仍然可以使用絕對XPath表達式//h1/text(),但你也可以具有可變practice設置一次,並在每個WebhealthItem1()實例

 ... 
     practice = hxs.select('//h1/text()').extract() 
     for result in results: 
      item = WebhealthItem1() 
      ... 
      item['practice'] = practice 

這是你的蜘蛛會是什麼樣這些變化使用它:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from webhealth.items1 import WebhealthItem1 

class WebhealthSpider(BaseSpider): 

    name = "webhealth_content1" 

    download_delay = 5 

    allowed_domains = ["webhealth.co.nz"] 
    start_urls = [ 
     "http://auckland.webhealth.co.nz/provider/service/view/914136/" 
     ] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 

     practice = hxs.select('//h1/text()').extract() 
     items1 = [] 

     results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl') 
     for result in results: 
      item = WebhealthItem1() 
      #item['url'] = result.select('//dl/a/@href').extract() 
      item['practice'] = practice 
      item['hours'] = map(unicode.strip, 
       result.select('dt[contains(.," Contact hours")]/following-sibling::dd[1]/text()').extract()) 
      item['more_hours'] = map(unicode.strip, 
       result.select('dt[contains(., "More information")]/following-sibling::dd[1]/text()').extract()) 
      item['physical_address'] = map(unicode.strip, 
       result.select('dt[contains(., "Physical address")]/following-sibling::dd[1]/text()').extract()) 
      item['postal_address'] = map(unicode.strip, 
       result.select('dt[contains(., "Postal address")]/following-sibling::dd[1]/text()').extract()) 
      item['postcode'] = map(unicode.strip, 
       result.select('dt[contains(., "Postcode")]/following-sibling::dd[1]/text()').extract()) 
      item['district_town'] = map(unicode.strip, 
       result.select('dt[contains(., "District/town")]/following-sibling::dd[1]/text()').extract()) 
      item['region'] = map(unicode.strip, 
       result.select('dt[contains(., "Region")]/following-sibling::dd[1]/text()').extract()) 
      item['phone'] = map(unicode.strip, 
       result.select('dt[contains(., "Phone")]/following-sibling::dd[1]/text()').extract()) 
      item['website'] = map(unicode.strip, 
       result.select('dt[contains(., "Website")]/following-sibling::dd[1]/a/@href').extract()) 
      item['email'] = map(unicode.strip, 
       result.select('dt[contains(., "Email")]/following-sibling::dd[1]/a/text()').extract()) 
      items1.append(item) 
     return items1 

我還使用此代碼創建了Cloud9 IDE項目。你可以玩https://c9.io/redapple/so_19309960

+0

蜘蛛小錯字,第三行需要'webhealth.items1 import WebhealthItem1'(無法自己編輯,最小編輯爲6個字符)。最好的解釋隨代碼,最感謝。 –

+0

嗯,我用標準'items.py'來測試它。解決它的答案。 –

+0

還實現了'item ['hours'] = map(unicode.strip,result.select('dt [contains(。,「Contact hours」)]/following-sibling :: dd [1]/text()') .extract())'必須聲明爲'hours = hxs.select('// dl/dt [contains(text(),「Contact hours」)]/following-sibling :: dd [1]/text( )')。extract()'然後'item ['hours'] = hours'。慢慢地獲得這一點。 –