我在R. 尋找一些簡單的量化方法對我for循環加快程序我有一個句子和積極兩個字典和否定詞以下數據幀:向量化for循環中的R
# Create data.frame with sentences
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "orgtop",
"great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
"wouldnt bad")
negWords <- c("hate","bad","not good","horrible")
現在我創建原始數據幀的重複,以模擬一個大的數據集:
# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
# library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
rownames(sent) <- NULL
對於我的下一步計劃,我將不得不做與他們本身的字典降字排序評分(正字= 1和負字= -1)。
# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL
然後我定義下列函數for循環:
# Sentiment score function
scoreSentence2 <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(matchWords,sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
# library(qdapRegex)
sentence <- rm_white(sentence)
}
}
score
}
我呼籲句子前面的功能在我的數據幀:
# Apply scoreSentence function to sentences
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
# Time consumption for 700.000 sentences in sent data.frame:
# user system elapsed
# 1054.19 0.09 1056.17
# Add sentiment score to origin sent data.frame
sent <- cbind(sent, SentimentScore2)
所需的輸出是:
Words user SentimentScore2
just right size and i love this notebook 1 2
benefits great laptop 2 1
wouldnt bad notebook 3 1
very good quality 4 1
orgtop 5 0
.
.
.
所以f orth ...
請問,任何人都可以幫助我減少我原來的方法計算時間。由於我在R初學者的編程技巧,我最後:-) 任何您的幫助或建議將非常感激。非常感謝你提前。
正如我從代碼理解,你想刪除檢測到的單詞,但期望的輸出仍然有他們。那麼哪部分是不正確的,還是我讀錯了? – LauriK 2015-02-23 09:52:09
請詳細解釋您用SentimentScore2函數試圖達到的效果 – StrikeR 2015-02-23 09:57:22
刪除單詞是我的方法的一部分。降序排列正/負詞中的單詞後,將它們與句子中的單詞相匹配,然後將它們刪除,以使它們不出現在另一個循環中。期望的輸出必須包含它們,但它需要很長時間,所以這是問題... – martinkabe 2015-02-23 10:00:13