2015-10-04 39 views
1

我是Scrapy的新手,並且遵循基本文檔。Scrapy網站爬蟲返回無效路徑錯誤

我有一個網站,我試圖抓取一些鏈接,然後再導航這些鏈接。我特別想獲得Cokelore,學院和計算機和我用下面

import scrapy 

class DmozSpider(scrapy.Spider): 
    name = "snopes" 
    allowed_domains = ["snopes.com"] 
    start_urls = [ 
      "http://www.snopes.com/info/whatsnew.asp" 
    ] 

    def parse(self, response): 
      print response.xpath('//div[@class="navHeader"]/ul/') 
      filename = response.url.split("/")[-2] + '.html' 
      with open(filename, 'wb') as f: 
        f.write(response.body) 

這是我的錯誤

2015-10-03 23:17:29 [scrapy] INFO: Enabled item pipelines: 
2015-10-03 23:17:29 [scrapy] INFO: Spider opened 
2015-10-03 23:17:29 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2015-10-03 23:17:29 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2015-10-03 23:17:30 [scrapy] DEBUG: Crawled (200) <GET http://www.snopes.com/info/whatsnew.asp> (referer: None) 
2015-10-03 23:17:30 [scrapy] ERROR: Spider error processing <GET http://www.snopes.com/info/whatsnew.asp> (referer: None) 
Traceback (most recent call last): 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "/Users/Gaby/Documents/Code/School/689/tutorial/tutorial/spiders/dmoz_spider.py", line 11, in parse 
    print response.xpath('//div[@class="navHeader"]/ul/') 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/http/response/text.py", line 109, in xpath 
    return self.selector.xpath(query) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/selector/unified.py", line 100, in xpath 
    raise ValueError(msg if six.PY3 else msg.encode("unicode_escape")) 
ValueError: Invalid XPath: //div[@class="navHeader"]/ul/ 
2015-10-03 23:17:30 [scrapy] INFO: Closing spider (finished) 
2015-10-03 23:17:30 [scrapy] INFO: Dumping Scrapy stats: 

我得到我認爲錯誤必須做我的代碼與我xpath()/ul但我不明白爲什麼。 //div[@class="navHeader"]可以正常工作,並且一旦我開始添加屬性後就會開始破解。

的網站我想刮的部分的結構,像這樣

<DIV CLASS="navHeader">CATEGORIES:</DIV> 
    <UL> 
     <LI><A HREF="/autos/autos.asp">Autos</A></LI> 
     <LI><A HREF="/business/business.asp">Business</A></LI> 
     <LI><A HREF="/cokelore/cokelore.asp">Cokelore</A></LI> 
     <LI><A HREF="/college/college.asp">College</A></LI> 
     <LI><A HREF="/computer/computer.asp">Computers</A></LI> 
    </UL> 
<DIV CLASS="navSpacer"> &nbsp; </DIV> 
    <UL> 
     <LI><A HREF="/crime/crime.asp">Crime</A></LI> 
     <LI><A HREF="/critters/critters.asp">Critter Country</A></LI> 
     <LI><A HREF="/disney/disney.asp">Disney</A></LI> 
     <LI><A HREF="/embarrass/embarrass.asp">Embarrassments</A></LI> 
     <LI><A HREF="/photos/photos.asp">Fauxtography</A></LI> 
    </UL> 

回答

1

你只需要刪除後/。替換:

//div[@class="navHeader"]/ul/ 

有:

//div[@class="navHeader"]/ul 

注意,這個XPath實際上與此網頁上什麼都沒有。該ul元素是導航標題的兄弟 - 使用following-sibling

In [1]: response.xpath('//div[@class="navHeader"]/following-sibling::ul//li/a/text()').extract() 
Out[1]: 
[u'Autos', 
u'Business', 
u'Cokelore', 
u'College', 
# ... 
u'Weddings'] 
+0

在代碼中我表現不是'ul'元素'navHeader'類的孩子呢? – Rafa

+0

@ ralphie9224不要看,關閉'div'。這是令人困惑的縮進。 – alecxe