向量化for循環中的R

我在R. 尋找一些簡單的量化方法對我for循環加快程序我有一個句子和積極兩個字典和否定詞以下數據幀：向量化for循環中的R

# Create data.frame with sentences 
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop", 
         "wouldnt bad notebook", "very good quality", "orgtop", 
         "great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7), 
       stringsAsFactors=F) 

# Create pos/negWords 
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits", 
      "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great", 
      "wouldnt bad") 
negWords <- c("hate","bad","not good","horrible")

現在我創建原始數據幀的重複，以模擬一個大的數據集：

# Replicate original data.frame - big data simulation (700.000 rows of sentences) 
df.expanded <- as.data.frame(replicate(100000,sent$words)) 
# library(zoo) 
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),] 
rownames(sent) <- NULL

對於我的下一步計劃，我將不得不做與他們本身的字典降字排序評分（正字= 1和負字= -1）。

# Ordering words in pos/negWords 
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F) 
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1)) 
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar)) 
wordsDF <- wordsDF[order(-wordsDF[,3]),] 
rownames(wordsDF) <- NULL

然後我定義下列函數for循環：

# Sentiment score function 
scoreSentence2 <- function(sentence){ 
    score <- 0 
    for(x in 1:nrow(wordsDF)){ 
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
    count <- length(grep(matchWords,sentence)) # count them 
    if(count){ 
     score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
     sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF 
     # library(qdapRegex) 
     sentence <- rm_white(sentence) 
    } 
    } 
    score 
}

我呼籲句子前面的功能在我的數據幀：

# Apply scoreSentence function to sentences 
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2)) 
# Time consumption for 700.000 sentences in sent data.frame: 
# user  system elapsed 
# 1054.19 0.09  1056.17 
# Add sentiment score to origin sent data.frame 
sent <- cbind(sent, SentimentScore2)

所需的輸出是：

Words            user  SentimentScore2 
just right size and i love this notebook   1   2 
benefits great laptop        2   1 
wouldnt bad notebook        3   1 
very good quality         4   1 
orgtop           5   0 
    . 
    . 
    .

所以f orth ...

請問，任何人都可以幫助我減少我原來的方法計算時間。由於我在R初學者的編程技巧，我最後:-) 任何您的幫助或建議將非常感激。非常感謝你提前。

來源

2015-02-23 martinkabe

正如我從代碼理解，你想刪除檢測到的單詞，但期望的輸出仍然有他們。那麼哪部分是不正確的，還是我讀錯了？ – LauriK 2015-02-23 09:52:09

請詳細解釋您用SentimentScore2函數試圖達到的效果 – StrikeR 2015-02-23 09:57:22

刪除單詞是我的方法的一部分。降序排列正/負詞中的單詞後，將它們與句子中的單詞相匹配，然後將它們刪除，以使它們不出現在另一個循環中。期望的輸出必須包含它們，但它需要很長時間，所以這是問題... – martinkabe 2015-02-23 10:00:13

在「教人以魚不如給魚」的精神，我將帶您通過：

使您的代碼的副本：你會搞砸了！
查找瓶頸：

1A：使問題更小：
```
Rep <- 100 
df.expanded <- as.data.frame(replicate(nRep,sent$words)) 
library(zoo) 
sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),] 
```
1B：保持一個參考的解決方案：你會改變你的代碼，並在引入很少有活動驚人錯誤比優化代碼！
```
sentRef <- sent 
```
並添加相同的內容，但在代碼末尾註釋掉，以便記住您的引用的位置。爲了使它更容易檢查你是不是搞亂你的代碼，你可以在你的代碼的末尾自動測試：
```
library("testthat") 
expect_equal(sent,sentRef) 
```
1C：觸發代碼周圍探查一下：
```
Rprof() 
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2)) 
Rprof(NULL) 
```
1D：查看結果，基R：
```
summaryRprof() 
```
也有更好的工具，可以檢查包探查或 lineprof

lineprof 是我的首選工具，這裏真正的附加值，從而來縮小問題這兩條線：
```
matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
count <- length(grep(matchWords,sentence)) # count them 
```

修復它。

3.1幸運的是，主要的問題是相當簡單的：你不需要第一行在函數中，先移動它。順便也適用於你的paste0（）。您的代碼變成：

matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words 
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*') 

# Sentiment score function 
scoreSentence2 <- function(sentence){ 
    score <- 0 
    for(x in 1:nrow(wordsDF)){ 
     count <- length(grep(matchWords[x],sentence)) # count them 
     if(count){ 
      score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
      sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF 
      require(qdapRegex) 
      # sentence <- rm_white(sentence) 
     } 
    } 
    score 
}

這改變了1000名代表到2.32s的執行時間從
5.64s。不是一個糟糕的投資！

3.2下布特爾頸部是「數< - 」行，但我認爲陰影剛剛正確的答案:-)結合我們得到：

matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words 
matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*') 

# Sentiment score function 
scoreSentence2 <- function(sentence){ 
    score <- 0 
    for(x in 1:nrow(wordsDF)){ 
     count <- grepl(matchWords[x],sentence) # count them 
     score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
     sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF 
     require(qdapRegex) 
     # sentence <- rm_white(sentence) 
    } 
    score 
}

這裏，使0.18s或31倍...

來源

2015-02-23 12:59:42 cmbarbu

太棒了！非常感謝你，先生，你幫了我很多。你的方法是我的任務的最佳解決方案。 – martinkabe 2015-02-23 13:51:40

您可以輕鬆地矢量化你的scoreSentence2功能，因爲grep，grepl已經矢量化：

scoreSentence <- function(sentence){ 
    score <- rep(0, length(sentence)) 
    for(x in 1:nrow(wordsDF)){ 
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
    count <- grepl(matchWords, sentence) # count them 
    score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
    sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF 
    sentence <- rm_white(sentence) 
    } 
    return(score) 
} 
scoreSentence(sent$words)

注泰德的count實際上不計次數的表達式出現在一個句子（既不在你也不在我的版本中）。它只是告訴你表達式是否出現。如果你想實際計算它們，你可以使用下面的代碼。

count <- sapply(gregexpr(matchWords, sentence), function(x) length(x[x>0]))

來源

2015-02-23 10:35:08 shadow

非常好，非常感謝，現在它快得多（3倍）。 – martinkabe 2015-02-23 12:38:03

向量化for循環中的R

回答

相關問題