Scrapy：試圖迭代嵌套的項目，但不產生所需的輸出

我對Python和Scrapy非常陌生，但是當我嘗試迭代嵌套的html元素時，它不會產生所需的結果。Scrapy：試圖迭代嵌套的項目，但不產生所需的輸出

下面是HTML，我試圖報廢。

<div class="level1" role="main"> 
<div class="level2"> 
    <h1 id="fullStoreHeading" class="class_h1">Page Title</h1> 
    <div class="fsdColumn_3"> 
     <div class='fsdDeptBox'> 
      <img alt="" src="" aria-hidden="true" height="100%" width="100%"> 
      <h2 class="fsdDeptTitle">TV</h2> 
      <div class='fsdDeptCol'> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1001">Samsung</a> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1002">Vizio</a> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1003">Element</a> 
      </div> 
     </div> 
     <div class='fsdDeptBox'> 
      <img alt="" src="" aria-hidden="true" height="100%" width="100%"> 
      <h2 class="fsdDeptTitle">Laptop</h2> 
      <div class='fsdDeptCol'> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1004">Apple</a> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1005">Microsoft</a> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1006">Dell</a> 
      </div> 
     </div> 
    </div> 


    <div class="fsdColumn_3"> 
     <div class='fsdDeptBox'> 
      <img alt="" src="" aria-hidden="true" height="100%" width="100%"> 
      <h2 class="fsdDeptTitle">Video Game Console</h2> 
      <div class='fsdDeptCol'> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1007">Xbox One</a> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1008">Xbox 360</a> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1009">PS 5</a> 
      </div> 
     </div> 
     <div class='fsdDeptBox'> 
      <img alt="" src="" aria-hidden="true" height="100%" width="100%"> 
      <h2 class="fsdDeptTitle">SSD</h2> 
      <div class='fsdDeptCol'> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1010">Samsung Evo</a> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1011">Crucial</a> 
       <a class="class_a" href="/test?_encoding=UTF8&id=1012">Sandisk</a> 
      </div> 
     </div> 
    </div> 
</div>

我想從上面的HTML生成的輸出是一個列表：

產品類別 - >品牌 - >標識

例如

電視

Samsung 1001 

    Vizio 1002 

    Element 1003

筆記本

Apple 1004 

    Microsoft 1005 

    Dell 1006

視頻遊戲機

Xbox Onen 1007 

    Xbox 360 1008 

    PS4 1009

ProductCategories.py

def parse(self, response): 
    l = ItemLoader(item=ProductSpiderItem(), response=response) 

    titles = response.xpath('//*[@class="fsdDeptTitle"]') 

    for title in titles: 

     Product_Category= title.xpath('text()').extract() 

     l.add_value('Product_Category', Product_Category) 

     for brnd in 
      title.xpath('//*[@class="fsdDeptCol"]/a[@class="class_a"]'): 

       Brand = brnd.xpath('text()').extract() 
       l.add_value('Brand', Brand) 

    return l.load_item()

此時它將所有產品類別從「Outer For Loop」打印一次，而「Inner For Loop」打印所有品牌，不論產品類別如何，「Inner For Loop」打印所有品牌，只要「外部For循環「運行。

我真的很感謝任何幫助來解決這個問題。

非常感謝。

來源

2017-10-17 Raj

你的第一個'for'循環發送它遍歷HTML的<h2 class="fsdDeptTitle">SSD</h2>部分。那麼你想要做的就是在代碼中查找class=class_a。它不能做到這一點，因爲第一個'for'循環太專用，不能選擇'class_a'所在的HTML。

你可以通過讓'for'循環在HTML中更高一級來解決這個問題。

titles = response.xpath("//*[@class='fsdDeptBox']") 
for title in titles: 
    Product_Category=title.xpath('text()').extract() 
    l.add_value('Product_Category', Product_Category) 

    for brnd in title.xpath('div[@class="fsdDeptCol"]'): 
     Brand = brnd.xpath('*/text()').extract() 
     l.add_value('Brand', Brand) 
    return l.Load_item()

我改變了第一「for」循環選擇足夠的HTML以包括對「class_a」文本

旁註的路徑。我不太瞭解正確的HTML術語，但我希望這仍然有道理。

來源

2017-10-17 14:35:14 SuperStickman22

對不起，我忘了在寫這篇文章的時候放*。我編輯了這篇文章。根據你的建議，我嘗試了'* [@ class =「fsdDeptCol」]/a [@ class =「class_a」]'...但它沒有返回任何品牌記錄。 – Raj

我編輯了我的答案以更好地回答問題。我將我的計算機上的HTML文件保存爲文件，並使用scrapy shell將其打開。我能夠得到所需的輸出。 – SuperStickman22

是的..現在工作..感謝很多！ – Raj

我想你應該多檢查一下ItemLoader的工作方式。他們還取決於如何您的項目和項目裝載機的定義，例如，讓我們假設你已經定義是這樣的：

class ProductItem(Item): 
    category = Field() 
    brand = Field() 
class ProductItemLoader(ItemLoader): 
    default_item_class = ProductItem 
    default_output_processor = TakeFirst()

，那麼你可以做這樣的事情：

for product in response.css('.fsdDeptCol a'): 
    il = ProductItemLoader(selector=product) 
    il.add_xpath('category', './ancestor::*/preceding-sibling::h2/text()') 
    il.add_xpath('brand', './text()') 
    yield il.load_item()

來源

2017-10-18 19:48:20 Wilfredo

非常感謝...是否有任何從href值中提取「id」的最佳方法。我可以從href值中提取整個字符串。 – Raj

當然，假設你有'id'字段（並且從w3lib和scrapy的處理器正確導入）只需添加像'il.add_xpath（'id'，'./@href'，MapCompose（lambda x：url_query_parameter（ x，'id'）））' – Wilfredo

再次感謝。我希望我能再給你一次投票。請向我推薦任何好的文章或書籍，我可以學習像MapCompose等先進技術。 – Raj

Scrapy：試圖迭代嵌套的項目，但不產生所需的輸出

回答

相關問題