2015-04-08 30 views
1

我正在嘗試做一個非常簡單的詞語,在R中產生非常意想不到的結果。在下面的代碼中,'complete'變量是'NA'。爲什麼我無法完成簡單的詞幹?r中的詞幹不按預期工作

library(tm) 
library(SnowballC) 
dict <- c("easy") 
stem <- stemDocument(dict, language = "english") 
complete <- stemCompletion(stem, dictionary=dict) 

謝謝!

回答

1

你可以看到stemCompletion()功能的內部與tm:::stemCompletion

function (x, dictionary, type = c("prevalent", "first", "longest", "none", "random", "shortest")){ 
if(inherits(dictionary, "Corpus")) 
    dictionary <- unique(unlist(lapply(dictionary, words))) 
type <- match.arg(type) 
possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s",w), dictionary, value = TRUE)) 
switch(type, first = { 
    setNames(sapply(possibleCompletions, "[", 1), x) 
}, longest = { 
    ordering <- lapply(possibleCompletions, function(x) order(nchar(x), 
     decreasing = TRUE)) 
    possibleCompletions <- mapply(function(x, id) x[id], 
     possibleCompletions, ordering, SIMPLIFY = FALSE) 
    setNames(sapply(possibleCompletions, "[", 1), x) 
}, none = { 
    setNames(x, x) 
}, prevalent = { 
    possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), 
     decreasing = TRUE)) 
    n <- names(sapply(possibleCompletions, "[", 1)) 
    setNames(if (length(n)) n else rep(NA, length(x)), x) 
}, random = { 
    setNames(sapply(possibleCompletions, function(x) { 
     if (length(x)) sample(x, 1) else NA 
    }), x) 
}, shortest = { 
    ordering <- lapply(possibleCompletions, function(x) order(nchar(x))) 
    possibleCompletions <- mapply(function(x, id) x[id], 
     possibleCompletions, ordering, SIMPLIFY = FALSE) 
    setNames(sapply(possibleCompletions, "[", 1), x) 
}) 

}

x說法是你的朵朵而言,dictionary是unstemmed。唯一重要的是第五條;它爲詞典術語列表中的詞幹單詞做了一個簡單的正則表達式匹配。

possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s",w), dictionary, value = TRUE)) 

因此它失敗了,因爲它找不到與「easy」「easi」匹配。如果你的詞典中還有「最簡單」這個詞,那麼這兩個詞都是匹配的,因爲現在有一個詞典詞有相同的開頭四個字母匹配。

library(tm) 
library(SnowballC) 
dict <- c("easy","easiest") 
stem <- stemDocument(dict, language = "english") 
complete <- stemCompletion(stem, dictionary=dict) 
complete 
    easi easiest 
"easiest" "easiest" 
+0

謝謝您的解釋!我想我現在應該看看幹函數爲什麼它實際上把單詞'easy'變成'easi'。 – user2630162

0

wordStem()似乎做吧..

library(tm) 
library(SnowballC) 
dict <- c("easy") 
> wordStem(dict) 
[1] "easi" 
+0

該詞幹的作品。我的觀點是,完成並沒有。我希望stemCompletion函數能夠將easi替換爲easy。我錯誤地認爲它應該? – user2630162

+0

是啊,看起來它只是失敗的「容易」。試試'dict < - c(「easy」,「easiest」,「更容易」)並重新運行。似乎它只是無法弄清楚「容易」 – cory

+0

@cory這正是這種情況。詳情請參閱我的回答 – christopherlovell