2015-06-27 43 views
1

我試圖將亞馬遜產品讀入scrapy。使用這個XPath從隨機類別 開始:嵌套Xpaths的Scrapy和XPath問題

products = Selector(response).xpath('//div[@class="s-item-container"]') 
for product in products: 
    item = AmzItem() 
    item['title'] = product.xpath('//a[@class="s-access-detail-page"]/@title').extract()[0] 
    item['url'] = product.xpath('//a[@class="s-access-detail-page"]/@href').extract()[0] 
    yield item 

('//div[@class="s-item-container"]')返回與一個類別頁面上的產品的div - 這是正確的。

現在,我將如何獲得產品的鏈接?

//代表代碼 究竟在哪兒一與@class應選擇合適的班級 但我得到一個:

item['title'] = product.xpath('//a[@class="s-access-detail-page"]/@title').extract()[0] exceptions.IndexError: list index out of range

所以我的列表匹配這個XPath必須是空的 - 但我不明白爲什麼?

編輯:
的HTML將看起來像:

<div class="s-item-container" style="height: 343px;"> 
<div class="a-row a-spacing-base"> 
    <div class="a-column a-span12 a-text-left"> 
     <div class="a-section a-spacing-none a-inline-block s-position-relative"> 
      <a class="a-link-normal a-text-normal" href="http://rads.stackoverflow.com/amzn/click/B0105S434A"><img alt="Product Details" src="http://ecx.images-amazon.com/images/I/41%2BzrAY74UL._AA160_.jpg" onload="viewCompleteImageLoaded(this, new Date().getTime(), 24, false);" class="s-access-image cfMarker" height="160" width="160"></a> 
      <div class="a-section a-spacing-none a-text-center"> 
       <div class="a-row a-spacing-top-mini"> 
        <a class="a-size-mini a-link-normal a-text-normal" href="http://rads.stackoverflow.com/amzn/click/B0105S434A"> 
         <div class="a-box"> 
          <div class="a-box-inner a-padding-mini"><span class="a-color-secondary">See more choices</span></div> 
         </div> 
        </a> 
       </div> 
      </div> 
     </div> 
    </div> 
</div> 
<div class="a-row a-spacing-mini"> 
    <div class="a-row a-spacing-none"> 
     <a class="a-link-normal s-access-detail-page a-text-normal" title="Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)" href="http://rads.stackoverflow.com/amzn/click/B0105S434A"> 
      <h2 class="a-size-base a-color-null s-inline s-access-title a-text-normal">Harry Potter Gryffindor School Fancy Robe Cloak Costume And Tie (Size S)</h2> 
     </a> 
    </div> 
    <div class="a-row a-spacing-mini"><span class="a-size-small a-color-secondary">by </span><span class="a-size-small a-color-secondary">Legend</span></div> 
</div> 
<div class="a-row a-spacing-mini"> 
    <div class="a-row a-spacing-none"><a class="a-size-small a-link-normal a-text-normal" href="http://rads.stackoverflow.com/amzn/click/B0105S434A"><span class="a-size-base a-color-price a-text-bold">$28.99</span><span class="a-letter-space"></span>new<span class="a-letter-space"></span><span class="a-color-secondary">(1 offer)</span><span class="a-letter-space"></span><span class="a-color-secondary a-text-strike"></span></a></div> 
</div> 
<div class="a-row a-spacing-none"><span name="B0105S434A"> 
    <span class="a-declarative" data-action="a-popover" data-a-popover="{&quot;max-width&quot;:&quot;700&quot;,&quot;closeButton&quot;:&quot;false&quot;,&quot;position&quot;:&quot;triggerBottom&quot;,&quot;url&quot;:&quot;/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&amp;asin=B0105S434A&amp;contextId=search&amp;ref=acr_search__popover&quot;}"><a href="javascript:void(0)" class="a-popover-trigger a-declarative"><i class="a-icon a-icon-star a-star-4"><span class="a-icon-alt">3.9 out of 5 stars</span></i><i class="a-icon a-icon-popover"></i></a></span></span> 
    <a class="a-size-small a-link-normal a-text-normal" href="http://rads.stackoverflow.com/amzn/click/B0105S434A">48</a> 
</div> 
</div> 
+0

請發佈相關HTML的片段。 – unutbu

回答

1

//a[@class="s-access-detail-page"]要求是完全class="s-access-detail-page",因爲XPath的工作原理與字符串,但不能與意義: )如果您有「多等級」,請使用contains函數

//a[contains(concat(' ', @class, ' '), " s-access-detail-page ")]/@title 
+0

我不得不刪除concat部分 - 否則我只會收到一個'exceptions.ValueError:無效的XPath' - 但現在它似乎正在工作。還有一個問題 - 不確定這是來自Xpath還是某物。否則 - 我會繼續挖掘。 – Chris

+0

如果你仔細地做了報價,它可能是xpath實現的問題:( – splash58

+0

現在它在幾個領域工作 - 謝謝指出我與包含正確的方向。 – Chris

2

它應該是:

# ------------- The dot makes the query relative to product 
product.xpath('.//a[@class="s-access-detail-page"]/@title') 
+0

不 - 我仍然收到此版本的空列表。 但我已經添加了我的HTML也許有幫助? – Chris

+0

好吧,讓我檢查一下, – hek2mgl

+0

'a @ class =「s-access-detail-page」'不是'div @ class =「s-item-container」'的孩子。「這不是很明顯嗎? – hek2mgl