XPath錯誤當使用「\ d」從Python提取Scrapy for Python的數據2

我試圖從scrapy中爲Python 2提取數據。我現在意識到我不能在我的提取中使用\ d這樣的正則表達式命令div Xpath。我如何解決這個問題？與\ d {2}我試圖告訴蟒蛇「哎，還有被認爲這裏是一個數以1-100之間的值」由於事先XPath錯誤當使用「 d」從Python提取Scrapy for Python的數據2

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from craigslist_sample.items import CraigslistSampleItem 
import re 

class MySpider(CrawlSpider): 
    name = "craigs" #add the 's' to make functional = "craigs" 
    allowed_domains = ["craigslist.org"] 
    start_urls = ["http://philadelphia.craigslist.org/cta/"] 

    rules = (Rule (SgmlLinkExtractor(allow=("index\d\d\d{,3}\.html",),restrict_xpaths= ('//*[@id="toc_rows"]/div[3]/div/div/span/a',)) 
, callback="parse_items", follow= True), 
) 

def parse_items(self, response): 
    hxs = HtmlXPathSelector(response) 
    titles = hxs.select('//span[@class="pl"] | //span[@class="12"]') 
    items = [] 

    for titles in titles: 
     item = CraigslistSampleItem() 
     item ["price"] = titles.select('//*[@id="toc_rows"]/div[2]/p[position() <=100])/span[3]/span[1]/text()').extract() 
     item ["date"] = titles.select('//*[@id="toc_rows"]/div[2]/p[position() <=100]]/span[2]/span/text()').extract() 
     item ["title"] = titles.select("a/text()").extract() 
     item ["link"] = titles.select("a/@href").extract() 
     items.append(item) 
    return(items)

，並從URL中的HTML snipet是這樣的：

項[ 「日期」] =跨度類= 「日期」>一月12 /跨度>

項[ 「價格」] =跨度類= 「價格」> $ 1950 /跨度>

都存在於這個父輩祖先節點下 div id =「toc_rows」

來源

2014-01-12 skellyboy

我假設p[\d{,2}]是指「前兩個<p>元素」。

這是通過position()：p[position() <= 2]完成的。（提示：position()從1開始計數。）

請注意position()計數上下文敏感。如果您選擇p元素，則它會計數它們，而不是它們前面的元素數量。

<div> 
    <p>First paragraph</p>  <!-- div/p[1] or div/p[position() = 1] --> 
    <div>Something else</div> <!-- div/div/[1] or div/div[position() = 1] --> 
    <p>Second paragraph</p> <!-- div/p[2] or div/p[position() = 2] --> 

    <!-- div/p[position() <= 2] will select both <p> here --> 
</div>

EDIT（問題進行了修改後）。這裏是我會做什麼：

首先，選擇所有的行："//div[@id = 'toc_row']//div[@class = 'row']"
然後，對於每一行，選擇...
- 價格："./span[@class = 'price']/text()"
- 日期："./span[@class = 'date']/text()"
- 標題："./span[@class = 'pl']/a/text()"
- 鏈接："./span[@class = 'pl']/a/@href"

來源

2014-01-12 21:03:41 Tomalak

否「p」是div字符串的一部分。有100個值我需要提取。所有的div字符串看起來像p [1]一直到p [100]。我打算告訴Pythong「嘿，這裏的數字應該是一個數值在1-100之間的數字」，但問題是\ d {，}是一個在Xpath塊內編碼的正則表達式命令。當我運行整個代碼時，xpath給我無效的路徑錯誤 – skellyboy

'p'是div字符串的一部分？把你的HTML包含在問題中（無論如何，你應該從頭開始做）。 – Tomalak

哎呦！新的社區。感謝答覆<3.整個代碼塊已被粘貼。它沒什麼太花哨的 – skellyboy

XPath錯誤當使用「\ d」從Python提取Scrapy for Python的數據2

回答

相關問題