用r代表蒸汽字

我很難理解R代詞的過程。用r代表蒸汽字

在我的例子，我創建了下面的語料庫對象

a <- Corpus(VectorSource("device so much more funand unlike most android torrent download clients"))

因此，一個是

a[[1]]$content 

[1] "device so much more funand unlike most android torrent download clients"

在該字符串中的第一個字是「設備」，我創建了術語矩陣

b <- TermDocumentMatrix(a, control = list(stemming = TRUE))

並得到這個作爲輸出

dimnames(b)$Terms 
[1] "android" "client" "devic" "download" "funand" "more"  "most"  "much"  "torrent" 
[10] "unlik"

我想知道的是爲什麼我在「設備」和「不同」時丟失了「e」，但沒有在「更多」處損失。

我該如何避免這種情況發生在這個詞和其他一些？

謝謝。

2015-08-26 Tomer

閱讀Porter詞幹程序的文檔。這是關於SO的話題：使用[CrossValidated]（http://stats.stackexchange.com/search?q=Porter+stemmer）。除非你真的想寫一個自定義詞幹，這是一個不同的問題。 – smci

另一種選擇是使用MorphAdorner lem馬西澤在西北大學。 This answer具有lemmatize(...)函數的代碼。你可以看到，它從「客戶端」中刪除「s」，而不是從「設備」中刪除「e」。

2015-08-27 06:01:47 jlhoward

我假設您使用tm和SnowballC包。

在這些軟件包中使用Porter Stemming algorithm（英文）工作。

如果你想玩弄所產生的算法，可以運行：

getStemLanguages()

並嘗試使用其他 - 唯一的其他英語建在這裏：

wordStem(words, language = "english")

這爲您的數據，返回相同：

[1] "android" "client" "devic" "download" "funand" "more"  "most"  "much"  "torrent" 
[10] "unlik"

2015-08-26 21:48:57 jeremycg

回答