我剛開始使用XPath進行html抓取,所以我對語法有點困惑。我試圖從SOURSE代碼的下面的代碼片段提取網址:使用XPATH擦掉屬性值?
<a href="/realestateandhomes-detail/15645-SW-74th-Circle-Dr-Apt-5_Miami_FL_33193_M69309-37779">
<img alt="15645 Sw 74th Circle Dr Apt 5, Miami, FL 33193" title="15645 Sw 74th Circle Dr Apt 5, Miami, FL 33193" class="js-srp-listing-photos" itemprop="image" data-src="https://ap.rdcpix.com/1980533383/49e7a93da461352c04b8e7146a8d2ceel-m0xd-w480_h480_q80.jpg" data-omtag="srp-listMap:result:photo" src="https://ap.rdcpix.com/1980533383/49e7a93da461352c04b8e7146a8d2ceel-m0xd-w480_h480_q80.jpg" />
</a>
html的路徑如下:
<body>
<li>
<div>
<a></a>
我使用scrapy來解析HTML頁面,這是我的到目前爲止的代碼:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from realtor.items import RealtorItem
class RealtorSpider(BaseSpider):
name = "realtor"
allowed_domains = ["realtor.com"]
start_urls = [
"http://www.realtor.com/realestateandhomes-search/Miami_FL"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//li/div/a/@href')
items = []
for site in sites:
item = RealtorItem()
item['link'] = site.select('div/a/@href').extract()
items.append(item)
return items
當我運行它返回在即項[] = site.select()線16的錯誤代碼中提取()。我不確定語法是否正確,或者我缺少另一個潛在問題。
誤差
KeyError: 'RealtorItem does not supprot field: link'
我items.py代碼如下:
from scrapy.item import Item, Field
class RealtorItem(Item):
link = scrapy.Field()
您使用的是scrapy的什麼版本? –
它是scrapy v 1.4.0 –