scrapy response.xpath只挑選出的第一項

我有HTML結構scrapy response.xpath只挑選出的第一項

<div class="column first"> 
    <div class="detail"> 
     <strong>Phone: </strong> 
     <span class="value"> 012-345-6789</span> 
    </div> 
    <div class="detail"> 
     <span class="value">1 Street Address, Big Road, City, Country</span> 
    </div> 
    <div class="detail"> 
     <h3 class="inline">Area:</h3> 
     <span class="value">Georgetown</span> 
    </div> 
    <div class="detail"> 
     <h3 class="inline">Nearest Train:</h3> 
     <span class="value">Georgetown Station</span> 
    </div> 
    <div class="detail"> 
     <h3 class="inline">Website:</h3> 
     <span class="value"><a href='http://www.website.com' target='_blank'>www.website.com</a></span> 
    </div> 
    </div>

當我scrapy shell中運行sel = response.xpath('//span[@class="value"]/text()')我得到了我希望回來了，這就是：

[<Selector xpath='//span[@class="value"]/text()' data=u' 012-345-6789'>, <Selector xpath='//span[@class="value"]/text()' data=u'1 Street Address, Big Road, City, Country'>, <Selector xpath='//span[@class="value"]/text()' data=u'Georgetown Station'>, <Selector xpath='//span[@class="value"]/text()' data=u' '>, <Selector xpath='//span[@class="value"]/text()' data=u'January, 2016'>]

然而，在我的scrapy蜘蛛的分析塊中，它只返回第一項

def parse(self, response): 
    def extract_with_xpath(query): 
     return response.xpath(query).extract_first().strip() 

    yield { 
     'details': extract_with_xpath('//span[@class="value"]/text()') 
    }

我意識到我是你唱extract_first()但如果我用extract()它打破了，儘管我知道extract()是一個合法的功能。

我做錯了什麼？我需要遍歷 extract_with_xpath('//span[@class="value"]/text()')一部分？

感謝您的啓發！

在items.py

來源

2016-10-01 matski

我覺得我失去了一些東西。我沒有看到任何具有名爲「title」的類的屬性的div標籤。你想從你的HTML文檔中提取什麼？ – Mangohero1

'標題'部分提取很好。我已將其從代碼中移除以避免更多混淆。感謝您指出了這一點。 – matski

@Drew戴維斯我試圖提取所有從'文本<跨度類=「值」>'元素。但目前我的刮只拉第一個。 – matski

，指明─

from scrapy.item import Item, Field 

class yourProjectNameItem(Item): 
    # define the fields for your item here like: 
    name = Field() 
    details= Field()

在scrapy蜘蛛：進口：

from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from yourProjectName.items import yourProjectNameItem 
import re

和解析功能如下：

def parse_item(self, response): 
    hxs = HtmlXPathSelector(response) 
    i = yourProjectNameItem() 

    i['name'] = hxs.select('YourXPathHere').extract() 
    i['details'] = hxs.select('YourXPathHere').extract() 

    return i

希望這能解決問題。你可以參考我的項目上的git：https://github.com/omkar-dsd/SRMSE/tree/master/Scrapers/NasaScraper

來源

2016-10-01 05:30:25

感謝您的回答。這將完全重寫我的代碼。到目前爲止我所做的大部分工作，我唯一的問題是迭代其中一個響應中的一些標記。 – matski

scrapy response.xpath只挑選出的第一項

回答

相關問題