2012-07-24 71 views
1

你好我是scrapy工作刮XML網址如何用刮scrapy XML網址

以下假設是我spider.py代碼

class TestSpider(BaseSpider): 
    name = "test" 
    allowed_domains = {"www.example.com"} 


    start_urls = [ 
     "https://example.com/jobxml.asp" 
     ] 


    def parse(self, response): 
     print response,"??????????????????????" 

結果:

2012-07-24 16:43:34+0530 [scrapy] INFO: Scrapy 0.14.3 started (bot: testproject) 
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState 
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled item pipelines: 
2012-07-24 16:43:34+0530 [test] INFO: Spider opened 
2012-07-24 16:43:34+0530 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2012-07-24 16:43:36+0530 [testproject] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 1 times): 400 Bad Request 
2012-07-24 16:43:37+0530 [test] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 2 times): 400 Bad Request 
2012-07-24 16:43:38+0530 [test] DEBUG: Gave up retrying <GET https://example.com/jobxml.asp> (failed 3 times): 400 Bad Request 
2012-07-24 16:43:38+0530 [test] DEBUG: Crawled (400) <GET https://example.com/jobxml.asp> (referer: None) 
2012-07-24 16:43:38+0530 [test] INFO: Closing spider (finished) 
2012-07-24 16:43:38+0530 [test] INFO: Dumping spider stats: 
    {'downloader/request_bytes': 651, 
    'downloader/request_count': 3, 
    'downloader/request_method_count/GET': 3, 
    'downloader/response_bytes': 504, 
    'downloader/response_count': 3, 
    'downloader/response_status_count/400': 3, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2012, 7, 24, 11, 13, 38, 573931), 
    'scheduler/memory_enqueued': 3, 
    'start_time': datetime.datetime(2012, 7, 24, 11, 13, 34, 803202)} 
2012-07-24 16:43:38+0530 [test] INFO: Spider closed (finished) 
2012-07-24 16:43:38+0530 [scrapy] INFO: Dumping global stats: 
    {'memusage/max': 263143424, 'memusage/startup': 263143424} 

是否scrapy不適用於XML抓取,如果是的任何人都可以請給我一個例子如何刮XML文件標籤數據

在此先感謝...........

+0

日誌輸出從哪裏來? HTTP請求在哪裏執行? – 2012-07-24 11:21:39

+0

@Tichodroma:我上面編輯了我的實際結果,請看看ti – 2012-07-24 11:26:34

回答

2

您有一個專門用於刮取xml feed的蜘蛛。這是scrapy文檔:

XMLFeedSpider例如

這些蜘蛛是很容易使用,讓我們來看一個例子:

from scrapy import log 
from scrapy.contrib.spiders import XMLFeedSpider 
from myproject.items import TestItem 

class MySpider(XMLFeedSpider): 
    name = 'example.com' 
    allowed_domains = ['example.com'] 
    start_urls = ['http://www.example.com/feed.xml'] 
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value 
    itertag = 'item' 

    def parse_node(self, response, node): 
     log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract()))) 

     item = Item() 
     item['id'] = node.select('@id').extract() 
     item['name'] = node.select('name').extract() 
     item['description'] = node.select('description').extract() 
     return item 

這是一個沒有scrapy另一種方式:

這是一個函數,用於從給定的URL下載xml,請注意,一些導入不在這裏,這也將給你一個很好的下載XML文件的進展。

def get_file(self, dir, url, name): 
    s = urllib2.urlopen(url) 
    f = open('xml/test.xml','w') 
    meta = s.info() 
    file_size = int(meta.getheaders("Content-Length")[0]) 
    print "Downloading: %s Bytes: %s" % (name, file_size) 
    current_file_size = 0 
    block_size = 4096 
    while True: 
     buf = s.read(block_size) 
     if not buf: 
      break 
     current_file_size += len(buf) 
     f.write(buf) 
     status = ("\r%10d [%3.2f%%]" % 
       (current_file_size, current_file_size * 100./file_size)) 
     status = status + chr(8)*(len(status)+1) 
     sys.stdout.write(status) 
     sys.stdout.flush() 
    f.close() 
    print "\nDone getting feed" 
    return 1 

然後你解析你下載並保存iterparse該XML文件,是這樣的:

for event, elem in iterparse('xml/test.xml'): 
     if elem.tag == "properties": 
      print elem.text 

這只是你如何去通過XML樹的例子。

此外,這是我的舊代碼,所以你會更好地使用打開文件。

+0

感謝您的回覆,我所做的唯一事情就是像您提到的那樣從XMLFeedSpider繼承,而且我運行的代碼仍然是重試重試的同樣問題。 ....,這是URL的問題(因爲它非常冗長,實際上如果我們將這個保存到本地桌面,總大小約爲7.6 mb) – 2012-07-24 11:31:57

+0

它不應該是一個問題xml源通常是幾MB大小,但我不能確定,因爲我從來沒有使用過這個蜘蛛,因爲我實際上使用簡單的urllib2來下載xmlfeed,然後iterparse來解析它,如果你想我可以給你發一個這樣的樣本 – iblazevic 2012-07-24 11:35:09

+0

是的,肯定會對我有用 – 2012-07-24 11:39:51