2014-05-15 30 views
2

我指的是previously asked question:我想對德語推文進行情感分析,並使用下面的代碼從我提到的stackoverflow線程中。但是,我想做一個分析,得到實際的情緒分數作爲結果,而不僅僅是TRUE/FALSE的總和,無論是正面還是負面。任何想法,一個簡單的方法來做到這一點?Twitter Sentiment Analysis w R使用德語語言Set SentiWS3得分

您還可以在previous thread中找到單詞列表。

library(plyr) 
library(stringr) 

readAndflattenSentiWS <- function(filename) { 
    words = readLines(filename, encoding="UTF-8") 
    words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words) 
    words <- unlist(strsplit(words, ",")) 
    words <- tolower(words) 
    return(words) 
} 
pos.words <- c(scan("Post3/positive-words.txt",what='character', comment.char=';', quiet=T), 
       readAndflattenSentiWS("Post3/SentiWS_v1.8c_Positive.txt")) 
neg.words <- c(scan("Post3/negative-words.txt",what='character', comment.char=';', quiet=T), 
       readAndflattenSentiWS("Post3/SentiWS_v1.8c_Negative.txt")) 

score.sentiment = function(sentences, pos.words, neg.words, .progress='none') { 
    require(plyr) 
    require(stringr) 
    scores = laply(sentences, function(sentence, pos.words, neg.words) 
    { 
    # clean up sentences with R's regex-driven global substitute, gsub(): 
    sentence = gsub('[[:punct:]]', '', sentence) 
    sentence = gsub('[[:cntrl:]]', '', sentence) 
    sentence = gsub('\\d+', '', sentence) 
    # and convert to lower case: 
    sentence = tolower(sentence) 
    # split into words. str_split is in the stringr package 
    word.list = str_split(sentence, '\\s+') 
    # sometimes a list() is one level of hierarchy too much 
    words = unlist(word.list) 
    # compare our words to the dictionaries of positive & negative terms 
    pos.matches = match(words, pos.words) 
    neg.matches = match(words, neg.words) 
    # match() returns the position of the matched term or NA 
    # I don't just want a TRUE/FALSE! How can I do this? 
    pos.matches = !is.na(pos.matches) 
    neg.matches = !is.na(neg.matches) 
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum(): 
    score = sum(pos.matches) - sum(neg.matches) 
    return(score) 
    }, 
    pos.words, neg.words, .progress=.progress) 
    scores.df = data.frame(score=scores, text=sentences) 
    return(scores.df) 
} 

sample <- c("ich liebe dich. du bist wunderbar", 
      "Ich hasse dich, geh sterben!", 
      "i love you. you are wonderful.", 
      "i hate you, die.") 
(test.sample <- score.sentiment(sample, 
           pos.words, 
           neg.words)) 
+0

您的代碼是否正常運行?我猜'laply'應該是'lapply',但是你引用的帖子也寫道... –

+0

是的,它運行並且工作。我實際上已經嘗試過將它變成輕快樂隊,然後它再也不能工作了。我對這些功能還是比較陌生,所以我不知道爲什麼...... – juliasb

+0

啊,'laply'是plyr的一部分!很高興我沒有編輯「修復」,現在:-) –

回答

0

作爲一個起點,這條線:

words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words) 

是說「扔掉POS信息和情感值(剛剛離開你的單詞列表)

所以要做你想做的事情,你將需要以不同的方式解析數據,並且你將需要一個不同的數據結構。readAndflattenSentiWS當前返回一個vector,但你會想要返回一個查找表(使用env對象感覺很好,但如果我也想要POS信息,那麼data.frame開始感覺正確)。

之後,大部分主循環可以大致相同,但您需要收集這些值並對它們進行求和,而不僅僅是對正匹配和負匹配的數量進行求和。

1

任何想法一個簡單的方法來做到這一點?

好吧,是的。我用很多推文做同樣的事情。如果你真的進入情緒分析,你應該看看the Text Mining (tm) package

您將看到,使用文檔術語表工作使生活變得更加輕鬆。然而,我必須警告你 - 閱讀好幾種期刊,一堆文字方法通常只能精確分類60%的情感。如果你真的對做高質量研究感興趣,你應該深入Peter Norvig的優秀「Artificial Intelligence: A Modern Approch」。

...所以這肯定不是一個quick'n'dirty修復我的情緒方法。但是,兩個月前,我一直在這個問題上。

不過,我願做一個分析,得到實際的情感分數,結果

正如我一直在那裏,你可以改變你的sentiWS到這樣一個漂亮的CSV文件(負數):

NegBegriff NegWert 
Abbau -0.058 
Abbaus -0.058 
Abbaues -0.058 
Abbauen -0.058 
Abbaue -0.058 
Abbruch -0.0048 
... 

然後,您可以將它作爲一個很好的data.frame導入到R中。我用這個代碼片段:

### for all your words in each tweet in a row 
for (n in 1:length(words)) { 

    ## get the position of the match /in your sentiWS-file/ 
    tweets.neg.position <- match(unlist(words[n]), neg.words$NegBegriff) 
    tweets.pos.position <- match(unlist(words[n]), pos.words$PosBegriff) 

    ## now use the positions, to find the matching values and sum 'em up 
    score.pos <- sum(pos.words$PosWert[tweets.pos.position], na.rm = T) 
    score.neg <- sum(neg.words$NegWert[tweets.neg.position], na.rm = T) 
    score <- score.pos + score.neg 

    ## now we have the sentiment for one tweet, push it to the list 
    tweets.list.sentiment <- append(tweets.list.sentiment, score) 
    ## and go again. 
} 

## look how beautiful! 
summary(tweets.list.sentiment) 

### caveat: This code is pretty ugly and not at all good use of R, 
### however it works sufficiently. I am using approach from above, 
### thus I did not need to rewrite the latter. Up to you ;-) 

那麼,我希望它的作品。 (對於我的例子,它是點)

訣竅在於將sentiWS帶入一個很好的形式,這可以通過簡單的文本操作使用Excel宏,GNU Emacs,sed或其他你感覺舒服的工作來實現。