試圖修改答案this question我的問題,但不成功。Scrapy:解析列表項目到單獨的行
下面是一些例子的html代碼:
<div id="provider-region-addresses">
<h3>Contact details</h3>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>More information</dt>
<dd>North Shore Hospital</dd><dt>Physical address</dt>
<dd>124 Shakespeare Rd, Takapuna, Auckland 0620</dd><dt>Postal address</dt>
<dd>Private Bag 93503, Takapuna, Auckland 0740</dd><dt>Postcode</dt>
<dd>0740</dd><dt>District/town</dt>
<dd>
North Shore, Takapuna</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 486 8996</dd><dt>Fax</dt>
<dd>(09) 486 8342</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>Physical address</dt>
<dd>Helensville</dd><dt>Postal address</dt>
<dd>PO Box 13, Helensville 0840</dd><dt>Postcode</dt>
<dd>0840</dd><dt>District/town</dt>
<dd>
Rodney, Helensville</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 420 9450</dd><dt>Fax</dt>
<dd>(09) 420 7050</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>Physical address</dt>
<dd>Warkworth</dd><dt>Postal address</dt>
<dd>PO Box 505, Warkworth 0941</dd><dt>Postcode</dt>
<dd>0941</dd><dt>District/town</dt>
<dd>
Rodney, Warkworth</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 422 2700</dd><dt>Fax</dt>
<dd>(09) 422 2709</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>More information</dt>
<dd>Waitakere Hospital</dd><dt>Physical address</dt>
<dd>55-75 Lincoln Rd, Henderson, Auckland 0610</dd><dt>Postal address</dt>
<dd>Private Bag 93115, Henderson, Auckland 0650</dd><dt>Postcode</dt>
<dd>0650</dd><dt>District/town</dt>
<dd>
Waitakere, Henderson</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 839 0000</dd><dt>Fax</dt>
<dd>(09) 837 6634</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>More information</dt>
<dd>Hibiscus Coast Community Health Centre</dd><dt>Physical address</dt>
<dd>136 Whangaparaoa Rd, Red Beach 0932</dd><dt>Postcode</dt>
<dd>0932</dd><dt>District/town</dt>
<dd>
Rodney, Red Beach</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 427 0300</dd><dt>Fax</dt>
<dd>(09) 427 0391</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
</div>
這是我的蜘蛛;
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from webhealth.items1 import WebhealthItem1
class WebhealthSpider(BaseSpider):
name = "webhealth_content1"
download_delay = 5
allowed_domains = ["webhealth.co.nz"]
start_urls = [
"http://auckland.webhealth.co.nz/provider/service/view/914136/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
results = hxs.select('//*[@id="content"]/div[1]')
items1 = []
for result in results:
item = WebhealthItem1()
item['url'] = result.select('//dl/a/@href').extract()
item['practice'] = result.select('//h1/text()').extract()
item['hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text()').extract())
item['more_hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"More information")]/following-sibling::dd[1]/text()').extract())
item['physical_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Physical address")]/following-sibling::dd[1]/text()').extract())
item['postal_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postal address")]/following-sibling::dd[1]/text()').extract())
item['postcode'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postcode")]/following-sibling::dd[1]/text()').extract())
item['district_town'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"District/town")]/following-sibling::dd[1]/text()').extract())
item['region'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Region")]/following-sibling::dd[1]/text()').extract())
item['phone'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Phone")]/following-sibling::dd[1]/text()').extract())
item['website'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Website")]/following-sibling::dd[1]/a/@href').extract())
item['email'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Email")]/following-sibling::dd[1]/a/text()').extract())
items1.append(item)
return items1
從這裏,我怎麼分析列表中的項目上分離的線路,在名稱字段對應//h1/text()
價值?目前,我在一個單元格中獲取每個Xpath項目的列表。這與我聲明Xpaths的方式有關嗎?
感謝
蜘蛛小錯字,第三行需要'webhealth.items1 import WebhealthItem1'(無法自己編輯,最小編輯爲6個字符)。最好的解釋隨代碼,最感謝。 –
嗯,我用標準'items.py'來測試它。解決它的答案。 –
還實現了'item ['hours'] = map(unicode.strip,result.select('dt [contains(。,「Contact hours」)]/following-sibling :: dd [1]/text()') .extract())'必須聲明爲'hours = hxs.select('// dl/dt [contains(text(),「Contact hours」)]/following-sibling :: dd [1]/text( )')。extract()'然後'item ['hours'] = hours'。慢慢地獲得這一點。 –