如何將Stanford CoreNLP java庫與Ruby用於情感分析？

我試圖做情感分析在大語料庫的推文在本地MongoDB實例與Ruby on Rails 4，Ruby 2.1.2和Mongoid ORM。如何將Stanford CoreNLP java庫與Ruby用於情感分析？

我在Mashape.com上使用了免費提供的https://loudelement-free-natural-language-processing-service.p.mashape.com API，但是它在快速啓動序列中推送了幾百條推文後開始超時 - 顯然這並不意味着要通過數以萬計的推文這是可以理解的。

所以接下來我想我使用斯坦福CoreNLP庫這裏推廣：http://nlp.stanford.edu/sentiment/code.html

默認使用，除了使用的Java 1.8代碼庫，似乎是使用XML輸入和輸出文件。對於我的用例來說，這很煩人，因爲我有成千上萬的短消息，而不是長文本文件。我想要像使用方法一樣使用CoreNLP，並執行tweets.each類型的循環。

我想一種方法是構建一個包含所有推文的XML文件，然後從Java進程中取出一個並解析該文件並將其放回到數據庫中，但這對我來說是陌生的，而且會是一個大量的工作。

所以，我很高興地發現在命令行上面的方式連接到運行CoreNLP在網站上，並接受文本作爲標準輸入讓我沒有開始與文件系統的擺弄，而是養活文本作爲參數。但是，與使用loudelement free sentiment analysis API相比，爲每條推文單獨啓動JVM會增加巨大的開銷。

現在，我寫的代碼是醜陋而緩慢的，但它的工作原理。不過，我想知道是否有更好的方法在Ruby中運行CoreNLP java程序，而不必開始擺弄文件系統（創建臨時文件並將它們作爲參數）或編寫Java代碼？

下面是我使用的代碼：

def self.mass_analyze_w_corenlp # batch run the method in multiple Ruby processes 
    todo = Tweet.all.exists(corenlp_sentiment: false).limit(5000).sort(follow_ratio: -1) # start with the "least spammy" tweets based on follow ratio 
    counter = 0 

    todo.each do |tweet| 
    counter = counter+1 

    fork {tweet.analyze_sentiment_w_corenlp} # run the analysis in a separate Ruby process 

    if counter >= 5 # when five concurrent processes are running, wait until they finish to preserve memory 
     Process.waitall 
     counter = 0 
    end 

    end 
end 

def analyze_sentiment_w_corenlp # run the sentiment analysis for each tweet object 
    text_to_be_analyzed = self.text.gsub("'"){" "}.gsub('"'){' '} # fetch the text field of DB item strip quotes that confuse the command line 

    start = "echo '" 
    finish = "' | java -cp 'vendor/corenlp/*' -mx250m edu.stanford.nlp.sentiment.SentimentPipeline -stdin" 
    command_string = start+text_to_be_analyzed+finish # assemble the command for the command line usage below 

    output =`#{command_string}` # run the CoreNLP on the command line, equivalent to system('...') 
    to_db = output.gsub(/\s+/, "").downcase # since CoreNLP uses indentation, remove unnecessary whitespace 
    # output is in the format of "neutral, "positive", "negative" and so on 

    puts "Sentiment analysis successful, sentiment is: #{to_db} for tweet #{text_to_be_analyzed}." 

    self.corenlp_sentiment = to_db # insert result as a field to the object 
    self.save! # sentiment analysis done! 
end

來源

2015-02-11 herb

考慮編寫Java服務（WSDL，SOAP，REST或簡單的基於TCP）和紅寶石調用它。這是最常用的方式。如果您可以使用JRuby，則直接調用Java方法似乎很簡單。 [Here]（http://stackoverflow.com/questions/7284161/how-to-call-java-api-from-ruby-1-8-or-1-9）描述了從Ruby調用Java代碼的方法，沒有使用JRuby，但他們看起來很複雜。 – Qualtagh 2015-02-12 04:52:51

你見過CoreNLP的這個[Ruby port]（https://github.com/louismullie/stanford-core-nlp）嗎？ – diasks2 2015-02-12 04:57:02

@ diasks2，我想我已經看到了它，但是基於自述文件，它看起來並不像它在其中實施情感分析。我對默認情況下的CoreNLP聲稱的深度學習模型非常感興趣：http://nlp.stanford.edu/sentiment/ – herb 2015-02-12 07:35:02

正如@Qualtagh的評論所建議的那樣，我決定使用JRuby。

我第一次嘗試使用Java來使用MongoDB作爲接口（直接從MongoDB中讀取，使用Java/CoreNLP進行分析並回寫到MongoDB），但是MongoDB Java驅動程序比我使用的Mongoid ORM更復雜與Ruby，所以這就是爲什麼我覺得JRuby更合適。

爲Java做REST服務需要我先學習如何在Java中執行REST服務，這可能很簡單，或者不是。我不想花時間搞清楚這一點。

所以，我需要做的，我的代碼運行的代碼是：

def analyze_tweet_with_corenlp_jruby 
    require 'java' 
    require 'vendor/CoreNLPTest2.jar' # I made this Java JAR with IntelliJ IDEA that includes both CoreNLP and my initialization class 

    analyzer = com.me.Analyzer.new # this is the Java class I made for running the CoreNLP analysis, it initializes the CoreNLP with the correct annotations etc. 
    result = analyzer.analyzeTweet(self.text) # self.text is where the text-to-be-analyzed resides 

    self.corenlp_sentiment = result # adds the result into this field in the MongoDB model 
    self.save! 
    return "#{result}: #{self.text}" # for debugging purposes 
    end

來源

2015-02-22 12:30:39 herb

此外，一位朋友建議嘗試：https：//github.com/louismullie/treat – herb 2015-02-22 12:34:15

可以至少避免醜陋的和危險的命令行的東西用IO.popen打開，並與外部進程通信，例如：

input_string = " 
foo 
bar 
baz 
" 

output_string = 
    IO.popen("grep 'foo'", 'r+') do |pipe| 
     pipe.write(input_string) 
     pipe.close_write 
     pipe.read 
    end 

puts "grep said #{output_string.strip} but not bar"

編輯：爲了避免在每個項目上重新加載Java程序的開銷，可以打開圍繞todo.each循環的管道，與此類進程進行通信

inputs = ['a', 'b', 'c', 'd'] 

IO.popen('cat', 'r+') do |pipe| 

    inputs.each do |s| 
     pipe.write(s + "\n") 
     out = pipe.readline 

     puts "cat said '#{out.strip}'" 
    end 
end

也就是說，如果Java程序支持這種行緩衝的「批量」輸入。但是，如果不這樣做，修改它不應該很困難。

來源

2015-02-13 07:32:41 oseiskar

如何將Stanford CoreNLP java庫與Ruby用於情感分析？

回答

相關問題