只是嘗試scrapy，並試圖獲得一個基本的蜘蛛工作。我知道這可能只是我想念的東西，但我已經嘗試了所有我能想到的東西。Scrapy HtmlXPathSelector

我得到的錯誤是：

line 11, in JustASpider 
    sites = hxs.select('//title/text()') 
NameError: name 'hxs' is not defined

我的代碼是目前非常基本的，但我似乎仍不能找到我要去哪裏錯了。謝謝你的幫助！

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

class JustASpider(BaseSpider): 
    name = "google.com" 
    start_urls = ["http://www.google.com/search?hl=en&q=search"] 


    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//title/text()') 
     for site in sites: 
      print site.extract() 


SPIDER = JustASpider()

來源

2012-09-03 Keanan Koppenhaver

你如何運行你的蜘蛛？ 'scrapy抓取「google.com」'？ – Leo

你的代碼沒有問題（除了不需要再聲明SPIDER），它對我來說很有用。 –

@Leo這就是我一直在運行它。 –

我在最後刪除了SPIDER調用並刪除了for循環。只有一個標題標籤（正如人們所期望的那樣），它似乎是拋棄了循環。我有工作的代碼如下：

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

class JustASpider(BaseSpider): 
    name = "google.com" 
    start_urls = ["http://www.google.com/search?hl=en&q=search"] 


    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     titles = hxs.select('//title/text()') 
     final = titles.extract()

來源

2012-09-10 16:27:07

您的代碼有效，但最好使用蜘蛛的簡單名稱，如「google」或「googleSpider」而不是「google.com」 – parik

確保您正在運行您向我們展示的代碼。

嘗試在您的項目中刪除*.pyc文件。

來源

2012-09-05 04:47:16 warvariuc

刪除文件夾中的所有pyc文件後，我仍然收到相同的錯誤。如果我錯過了一個依賴項，我會得到一個導入錯誤嗎？ –

請在您的代碼中檢查縮進。也許你混合標籤與空格？ – warvariuc

我有一個類似的問題，NameError: name 'hxs' is not defined，以及與空格和製表符的問題：IDE使用空格代替製表符，你應該看看。

來源

2013-01-23 23:22:51

這個工作對我來說：

將文件保存爲test.py
命令scrapy runspider <filename.py>

例如：

scrapy runspider test.py

來源

2013-08-19 15:01:00

代碼看起來是正確的。

在最新版本的Scrapy中
HtmlXPathSelector已棄用。使用選擇：

來源

2014-02-14 05:14:58 dimka665

這只是一個演示，但它的工作原理。需要定製的場外。！

在/ usr/bin中/從scrapy.spider進口BaseSpider 從scrapy.selector進口HtmlXPathSelector包膜蟒蛇

類DmozSpider（BaseSpider）：名= 「DMOZ」 allowed_domains = [」 dmoz.org 「] start_urls = [ 」 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/」，「http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/」 ]

def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//ul/li') 
    for site in sites: 
     title = site.select('a/text()').extract() 
     link = site.select('a/@href').extract() 
     desc = site.select('text()').extract() 
     print title, link, desc

來源

2014-06-21 19:20:27 user3672836

你應該改變

from scrapy.selector import HtmlXPathSelector

到

from scrapy.selector import Selector

而且使用hxs=Selector(response)來代替。

來源

2015-04-26 05:38:32 neal

該代碼看起來相當舊版本。我建議使用這些代碼代替

from scrapy.spider import Spider 
 
from scrapy.selector import Selector 
 

 
class JustASpider(Spider): 
 
    name = "googlespider" 
 
    allowed_domains=["google.com"] 
 
    start_urls = ["http://www.google.com/search?hl=en&q=search"] 
 

 

 
    def parse(self, response): 
 
     sel = Selector(response) 
 
     sites = sel.xpath('//title/text()').extract() 
 
     print sites 
 
     #for site in sites: (I dont know why you want to loop for extracting the text in the title element) 
 
      #print site.extract()

希望它可以幫助和 here是一個很好的榜樣。

來源

2015-09-04 06:28:46

我使用Scrapy和BeautifulSoup4.0。對我來說，湯很容易閱讀和理解。如果您不必使用HtmlXPathSelector，則這是一個選項。希望這可以幫助！

import scrapy 
from bs4 import BeautifulSoup 
import Item 

def parse(self, response): 

    soup = BeautifulSoup(response.body,'html.parser') 
    print 'Current url: %s' % response.url 
    item = Item() 
    for link in soup.find_all('a'): 
     if link.get('href') is not None: 
      url = response.urljoin(link.get('href')) 
      item['url'] = url 
      yield scrapy.Request(url,callback=self.parse) 
      yield item

來源

2016-10-11 19:13:57 sarc360

Scrapy HtmlXPathSelector

回答

在/ usr/bin中/從scrapy.spider進口BaseSpider 從scrapy.selector進口HtmlXPathSelector包膜蟒蛇

相關問題