2015-02-23 215 views
4

我在R. 尋找一些簡單的量化方法對我for循環加快程序我有一個句子和積極兩個字典和否定詞以下數據幀:向量化for循環中的R

# Create data.frame with sentences 
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop", 
         "wouldnt bad notebook", "very good quality", "orgtop", 
         "great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7), 
       stringsAsFactors=F) 

# Create pos/negWords 
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits", 
      "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great", 
      "wouldnt bad") 
negWords <- c("hate","bad","not good","horrible") 

現在我創建原始數據幀的重複,以模擬一個大的數據集:

# Replicate original data.frame - big data simulation (700.000 rows of sentences) 
df.expanded <- as.data.frame(replicate(100000,sent$words)) 
# library(zoo) 
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),] 
rownames(sent) <- NULL 

對於我的下一步計劃,我將不得不做與他們本身的字典降字排序評分(正字= 1和負字= -1)。

# Ordering words in pos/negWords 
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F) 
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1)) 
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar)) 
wordsDF <- wordsDF[order(-wordsDF[,3]),] 
rownames(wordsDF) <- NULL 

然後我定義下列函數for循環:

# Sentiment score function 
scoreSentence2 <- function(sentence){ 
    score <- 0 
    for(x in 1:nrow(wordsDF)){ 
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
    count <- length(grep(matchWords,sentence)) # count them 
    if(count){ 
     score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
     sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF 
     # library(qdapRegex) 
     sentence <- rm_white(sentence) 
    } 
    } 
    score 
} 

我呼籲句子前面的功能在我的數據幀:

# Apply scoreSentence function to sentences 
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2)) 
# Time consumption for 700.000 sentences in sent data.frame: 
# user  system elapsed 
# 1054.19 0.09  1056.17 
# Add sentiment score to origin sent data.frame 
sent <- cbind(sent, SentimentScore2) 

所需的輸出是:

Words            user  SentimentScore2 
just right size and i love this notebook   1   2 
benefits great laptop        2   1 
wouldnt bad notebook        3   1 
very good quality         4   1 
orgtop           5   0 
    . 
    . 
    . 

所以f orth ...

請問,任何人都可以幫助我減少我原來的方法計算時間。由於我在R初學者的編程技巧,我最後:-) 任何您的幫助或建議將非常感激。非常感謝你提前。

+0

正如我從代碼理解,你想刪除檢測到的單詞,但期望的輸出仍然有他們。那麼哪部分是不正確的,還是我讀錯了? – LauriK 2015-02-23 09:52:09

+0

請詳細解釋您用SentimentScore2函數試圖達到的效果 – StrikeR 2015-02-23 09:57:22

+0

刪除單詞是我的方法的一部分。降序排列正/負詞中的單詞後,將它們與句子中的單詞相匹配,然後將它們刪除,以使它們不出現在另一個循環中。期望的輸出必須包含它們,但它需要很長時間,所以這是問題... – martinkabe 2015-02-23 10:00:13

回答

5

在「教人以魚不如給魚」的精神,我將帶您通過:

  1. 使您的代碼的副本:你會搞砸了!

  2. 查找瓶頸:

    1A:使問題更小:

    Rep <- 100 
    df.expanded <- as.data.frame(replicate(nRep,sent$words)) 
    library(zoo) 
    sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),] 
    

    1B:保持一個參考的解決方案:你會改變你的代碼,並在引入很少有活動驚人錯誤比優化代碼!

    sentRef <- sent 
    

    並添加相同的內容,但在代碼末尾註釋掉,以便記住您的引用的位置。爲了使它更容易檢查你是不是搞亂你的代碼,你可以在你的代碼的末尾自動測試:

    library("testthat") 
    expect_equal(sent,sentRef) 
    

    1C:觸發代碼周圍探查一下:

    Rprof() 
    SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2)) 
    Rprof(NULL) 
    

    1D:查看結果,基R:

    summaryRprof() 
    

    也有更好的工具,可以檢查包 探查 或 lineprof

    lineprof 是我的首選工具,這裏真正的附加值,從而來縮小問題這兩條線:

    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
    count <- length(grep(matchWords,sentence)) # count them 
    
  3. 修復它。

    3.1幸運的是,主要的問題是相當簡單的:你不需要第一行在函數中,先移動它。順便也適用於你的paste0()。您的代碼變成:

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words 
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*') 
    
    # Sentiment score function 
    scoreSentence2 <- function(sentence){ 
        score <- 0 
        for(x in 1:nrow(wordsDF)){ 
         count <- length(grep(matchWords[x],sentence)) # count them 
         if(count){ 
          score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
          sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF 
          require(qdapRegex) 
          # sentence <- rm_white(sentence) 
         } 
        } 
        score 
    } 
    

    這改變了1000名代表到2.32s的執行時間從
    5.64s。不是一個糟糕的投資!

    3.2下布特爾頸部是「數< - 」行,但我認爲 陰影剛剛正確的答案:-)結合我們得到:

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words 
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*') 
    
    # Sentiment score function 
    scoreSentence2 <- function(sentence){ 
        score <- 0 
        for(x in 1:nrow(wordsDF)){ 
         count <- grepl(matchWords[x],sentence) # count them 
         score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
         sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF 
         require(qdapRegex) 
         # sentence <- rm_white(sentence) 
        } 
        score 
    } 
    

這裏,使0.18s或31倍...

+0

太棒了!非常感謝你,先生,你幫了我很多。你的方法是我的任務的最佳解決方案。 – martinkabe 2015-02-23 13:51:40

1

您可以輕鬆地矢量化你的scoreSentence2功能,因爲grepgrepl已經矢量化:

scoreSentence <- function(sentence){ 
    score <- rep(0, length(sentence)) 
    for(x in 1:nrow(wordsDF)){ 
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words 
    count <- grepl(matchWords, sentence) # count them 
    score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue) 
    sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF 
    sentence <- rm_white(sentence) 
    } 
    return(score) 
} 
scoreSentence(sent$words) 

注泰德的count實際上不計次數的表達式出現在一個句子(既不在你也不在我的版本中)。它只是告訴你表達式是否出現。如果你想實際計算它們,你可以使用下面的代碼。

count <- sapply(gregexpr(matchWords, sentence), function(x) length(x[x>0])) 
+0

非常好,非常感謝,現在它快得多(3倍)。 – martinkabe 2015-02-23 12:38:03