2014-09-13 106 views
0

我一直在寫這個網頁刮刀,我不明白爲什麼它只是結束。下面是代碼:解析1鏈接後刮刀結束

import scrapy, MySQLdb, urllib 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors import LinkExtractor 
from scrapy import Request 


class MyItems(scrapy.Item): 
    topLinks = scrapy.Field() 
    artists = scrapy.Field() 

class mp3Spider(CrawlSpider): 
    name = 'mp3_scraper' 
    allowed_domains = [ 
     'example.com' 
    ] 
    start_urls = [ 
     'http://www.example.com' 
    ] 

    def __init__(self, *a, **kw): 
     super(mp3Spider, self).__init__(*a, **kw) 

     self.item = MyItems() 

    def parse(self, response): 
     f = open('topLinks', 'w') 
     self.item['topLinks'] = response.xpath("//div[contains(@class, 'en')]/a[contains(@class, 'hash')]/@href").extract() 

     for x in range(len(self.item['topLinks'])): 
      self.item['topLinks'][x] = 'http://www.example.com' + self.item['topLinks'][x] 

     for x in range(len(self.item['topLinks'])): 
      f.write(format(self.item['topLinks'][x]).encode('utf-8')+ '\n') 
      yield Request(url=self.item['topLinks'][x], callback=self.parse_artists) 

    def parse_artists(self, response): 
     f = open('artists', 'w') 
     self.item['artists'] = response.xpath("//ul[contains(@class, 'artist_list')]/li/a/text()").extract() 

     for x in range(len(self.item['artists'])): 
      f.write(format(self.item['artists'][x]).encode('utf-8') + '\n') 

因此,無論解析功能得到我需要的信息,但只有parse_artists解析1個鏈接。解析函數抓取我需要的所有鏈接,我可以看到它確實是因爲我將它們打印到文件中。所以說它抓住鏈接:example.com/artists/a,example.com/artists/b等。解析藝術家將只抓取example.com/artists/a然後停止。任何幫助將不勝感激,謝謝。薩姆

編輯:輸出日誌 -

C:\Python27\python.exe C:/Users/sam/PycharmProjects/mp3_scraper/mp3_scraper/mp3_scraper/main.py 
2014-09-13 12:28:24-0400 [scrapy] INFO: Scrapy 0.24.2 started (bot: mp3_scraper) 
2014-09-13 12:28:24-0400 [scrapy] INFO: Optional features available: ssl, http11 
2014-09-13 12:28:24-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mp3_scraper.spiders', 'SPIDER_MODULES': ['mp3_scraper.spiders'], 'BOT_NAME': 'mp3_scraper'} 
2014-09-13 12:28:24-0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2014-09-13 12:28:25-0400 [scrapy] INFO: Enabled item pipelines: 
2014-09-13 12:28:25-0400 [mp3_scraper] INFO: Spider opened 
2014-09-13 12:28:25-0400 [mp3_scraper] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-09-13 12:28:25-0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2014-09-13 12:28:25-0400 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/> (referer: None) 
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/z/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/0..9/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/w/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/x/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/u/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/q/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/v/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/y/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:26-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/t/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/o/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/p/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/r/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/n/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/s/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/l/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/h/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/k/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/i/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/g/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/m/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:27-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/j/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/f/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/e/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/c/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/d/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:28-0400 [mp3_scraper] DEBUG: Crawled (200) <GET http://www.myfreemp3.cc/artists/b/> (referer: http://www.myfreemp3.cc/artists/) 
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Closing spider (finished) 
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 10106, 
    'downloader/request_count': 27, 
    'downloader/request_method_count/GET': 27, 
    'downloader/response_bytes': 887850, 
    'downloader/response_count': 27, 
    'downloader/response_status_count/200': 27, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2014, 9, 13, 16, 28, 28, 908000), 
    'log_count/DEBUG': 29, 
    'log_count/INFO': 7, 
    'request_depth_max': 1, 
    'response_received_count': 27, 
    'scheduler/dequeued': 27, 
    'scheduler/dequeued/memory': 27, 
    'scheduler/enqueued': 27, 
    'scheduler/enqueued/memory': 27, 
    'start_time': datetime.datetime(2014, 9, 13, 16, 28, 25, 315000)} 
2014-09-13 12:28:28-0400 [mp3_scraper] INFO: Spider closed (finished) 

Process finished with exit code 0 
+0

您可以加入日誌輸出,當你執行你的蜘蛛問題Scrapy產生的? – amgaera 2014-09-13 16:27:06

+0

yep上傳了它 – johnc31 2014-09-13 16:30:16

回答

0

您打開artists文件中的w模式,截斷文件,如果它已經存在。因此,在蜘蛛結束後,只有最後一個被抓取的項目保留在文件中。

你應該打開文件進行追加(模式a)來解決這個問題:

def parse_artists(self, response): 
    f = open('artists', 'a') 
    ... 
+0

哦,我其實只是阻撓thx兄弟! – johnc31 2014-09-13 17:41:45