Twitter Sentiment Analysis w R使用德語語言Set SentiWS3得分

我指的是previously asked question：我想對德語推文進行情感分析，並使用下面的代碼從我提到的stackoverflow線程中。但是，我想做一個分析，得到實際的情緒分數作爲結果，而不僅僅是TRUE/FALSE的總和，無論是正面還是負面。任何想法，一個簡單的方法來做到這一點？Twitter Sentiment Analysis w R使用德語語言Set SentiWS3得分

您還可以在previous thread中找到單詞列表。

library(plyr) 
library(stringr) 

readAndflattenSentiWS <- function(filename) { 
    words = readLines(filename, encoding="UTF-8") 
    words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words) 
    words <- unlist(strsplit(words, ",")) 
    words <- tolower(words) 
    return(words) 
} 
pos.words <- c(scan("Post3/positive-words.txt",what='character', comment.char=';', quiet=T), 
       readAndflattenSentiWS("Post3/SentiWS_v1.8c_Positive.txt")) 
neg.words <- c(scan("Post3/negative-words.txt",what='character', comment.char=';', quiet=T), 
       readAndflattenSentiWS("Post3/SentiWS_v1.8c_Negative.txt")) 

score.sentiment = function(sentences, pos.words, neg.words, .progress='none') { 
    require(plyr) 
    require(stringr) 
    scores = laply(sentences, function(sentence, pos.words, neg.words) 
    { 
    # clean up sentences with R's regex-driven global substitute, gsub(): 
    sentence = gsub('[[:punct:]]', '', sentence) 
    sentence = gsub('[[:cntrl:]]', '', sentence) 
    sentence = gsub('\\d+', '', sentence) 
    # and convert to lower case: 
    sentence = tolower(sentence) 
    # split into words. str_split is in the stringr package 
    word.list = str_split(sentence, '\\s+') 
    # sometimes a list() is one level of hierarchy too much 
    words = unlist(word.list) 
    # compare our words to the dictionaries of positive & negative terms 
    pos.matches = match(words, pos.words) 
    neg.matches = match(words, neg.words) 
    # match() returns the position of the matched term or NA 
    # I don't just want a TRUE/FALSE! How can I do this? 
    pos.matches = !is.na(pos.matches) 
    neg.matches = !is.na(neg.matches) 
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum(): 
    score = sum(pos.matches) - sum(neg.matches) 
    return(score) 
    }, 
    pos.words, neg.words, .progress=.progress) 
    scores.df = data.frame(score=scores, text=sentences) 
    return(scores.df) 
} 

sample <- c("ich liebe dich. du bist wunderbar", 
      "Ich hasse dich, geh sterben!", 
      "i love you. you are wonderful.", 
      "i hate you, die.") 
(test.sample <- score.sentiment(sample, 
           pos.words, 
           neg.words))

來源

2014-05-15 juliasb

您的代碼是否正常運行？我猜'laply'應該是'lapply'，但是你引用的帖子也寫道... –

是的，它運行並且工作。我實際上已經嘗試過將它變成輕快樂隊，然後它再也不能工作了。我對這些功能還是比較陌生，所以我不知道爲什麼...... – juliasb

啊，'laply'是plyr的一部分！很高興我沒有編輯「修復」，現在:-) –

作爲一個起點，這條線：

words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)

是說「扔掉POS信息和情感值（剛剛離開你的單詞列表）

所以要做你想做的事情，你將需要以不同的方式解析數據，並且你將需要一個不同的數據結構。readAndflattenSentiWS當前返回一個vector，但你會想要返回一個查找表（使用env對象感覺很好，但如果我也想要POS信息，那麼data.frame開始感覺正確）。

之後，大部分主循環可以大致相同，但您需要收集這些值並對它們進行求和，而不僅僅是對正匹配和負匹配的數量進行求和。

來源

2014-05-16 00:45:56

任何想法一個簡單的方法來做到這一點？

好吧，是的。我用很多推文做同樣的事情。如果你真的進入情緒分析，你應該看看the Text Mining (tm) package。

您將看到，使用文檔術語表工作使生活變得更加輕鬆。然而，我必須警告你 - 閱讀好幾種期刊，一堆文字方法通常只能精確分類60％的情感。如果你真的對做高質量研究感興趣，你應該深入Peter Norvig的優秀「Artificial Intelligence: A Modern Approch」。

...所以這肯定不是一個quick'n'dirty修復我的情緒方法。但是，兩個月前，我一直在這個問題上。

不過，我願做一個分析，得到實際的情感分數，結果

正如我一直在那裏，你可以改變你的sentiWS到這樣一個漂亮的CSV文件（負數）：

NegBegriff NegWert 
Abbau -0.058 
Abbaus -0.058 
Abbaues -0.058 
Abbauen -0.058 
Abbaue -0.058 
Abbruch -0.0048 
...

然後，您可以將它作爲一個很好的data.frame導入到R中。我用這個代碼片段：

### for all your words in each tweet in a row 
for (n in 1:length(words)) { 

    ## get the position of the match /in your sentiWS-file/ 
    tweets.neg.position <- match(unlist(words[n]), neg.words$NegBegriff) 
    tweets.pos.position <- match(unlist(words[n]), pos.words$PosBegriff) 

    ## now use the positions, to find the matching values and sum 'em up 
    score.pos <- sum(pos.words$PosWert[tweets.pos.position], na.rm = T) 
    score.neg <- sum(neg.words$NegWert[tweets.neg.position], na.rm = T) 
    score <- score.pos + score.neg 

    ## now we have the sentiment for one tweet, push it to the list 
    tweets.list.sentiment <- append(tweets.list.sentiment, score) 
    ## and go again. 
} 

## look how beautiful! 
summary(tweets.list.sentiment) 

### caveat: This code is pretty ugly and not at all good use of R, 
### however it works sufficiently. I am using approach from above, 
### thus I did not need to rewrite the latter. Up to you ;-)

那麼，我希望它的作品。（對於我的例子，它是點）

訣竅在於將sentiWS帶入一個很好的形式，這可以通過簡單的文本操作使用Excel宏，GNU Emacs，sed或其他你感覺舒服的工作來實現。

來源

2015-09-02 09:08:27

Twitter Sentiment Analysis w R使用德語語言Set SentiWS3得分

回答

相關問題