Scrapy店返回變量項主要腳本

使用我很新的Scrapy，想嘗試以下操作：從網頁中提取一些值，將其存儲在一個變量，在我的主要腳本中使用它。所以我也跟着他們的教程，並改變了代碼爲我的目的：Scrapy店返回變量項主要腳本

import scrapy 
from scrapy.crawler import CrawlerProcess 


class QuotesSpider(scrapy.Spider): 
    name = "quotes" 
    start_urls = [ 
     'http://quotes.toscrape.com/page/1/' 
    ] 

    custom_settings = { 
     'LOG_ENABLED': 'False', 
    } 

    def parse(self, response): 
     global title # This would work, but there should be a better way 
     title = response.css('title::text').extract_first() 

process = CrawlerProcess({ 
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' 
}) 

process.crawl(QuotesSpider) 
process.start() # the script will block here until the crawling is finished 

print(title) # Verify if it works and do some other actions later on...

這將工作至今，但我敢肯定它不是一個良好的作風，甚至有一些不良的副作用，如果我定義標題變量爲全局。如果我跳過那一行，那麼我會得到「未定義的變量」錯誤當然是：/ 因此，我正在尋找一種方法來返回變量並在我的主腳本中使用它。

我已閱讀關於物品管道，但我無法使其工作。

任何幫助/想法都非常感謝:) 在此先感謝！

來源

2017-12-27 MaGi

更好地利用'global' - 它會更容易。管道不會幫助你。 – furas

使用global因爲你知道是不是一個很好的風格，特別是當你需要擴展需求。

我的建議是標題存儲到文件或目錄，並在主過程中使用它，或者如果你想處理其他腳本的標題，然後只需打開文件，並在你的腳本

閱讀題（注：請忽略壓痕問題）

spider.py

import scrapy 
from scrapy.crawler import CrawlerProcess 

namefile = 'namefile.txt' 
current_title_session = []#title stored in current session 
file_append = open(namefile,'a',encoding = 'utf-8') 

try: 
    title_in_file = open(namefile,'r').readlines() 
except: 
    title_in_file = open(namefile,'w') 

class QuotesSpider(scrapy.Spider): 
    name = "quotes" 
    start_urls = [ 
     'http://quotes.toscrape.com/page/1/' 
    ] 

    custom_settings = { 
     'LOG_ENABLED': 'False', 
    } 

    def parse(self, response): 
     title = response.css('title::text').extract_first() 
     if title +'\n' not in title_in_file and title not in current_title_session: 
      file_append.write(title+'\n') 
      current_title_session.append(title) 
if __name__=='__main__': 
    process = CrawlerProcess({ 
     'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' 
    }) 

    process.crawl(QuotesSpider) 
    process.start() # the script will block here until the crawling is finished

來源

2017-12-29 03:55:54 AndyWang

謝謝，這解決與全球語句的問題，雖然我不知道如果是優雅創建另一個文件來處理它。反正 - 這對我來說工作得很好:-) – MaGi

製作一個變量global應該爲你所需要的工作，但正如你所說的那樣，它不是很好的風格。

我真的建議使用不同的服務進程之間的通信，像Redis，所以你不會有你的蜘蛛和任何其他過程之間的衝突。

設置和使用非常簡單，文檔有一個very simple example。

實例化於主過程中的蜘蛛，並再次內部的redis的連接（思考它們作爲單獨的進程）。蜘蛛設置變量和主要過程讀取（或get）的信息。

來源

2017-12-27 14:46:31 eLRuLL

謝謝，在短期內，我會去furas'和AndyWangs回答，但如果我的時候，我會讀入Redis的:) – MaGi

Scrapy店返回變量項主要腳本

回答

相關問題