2016-01-22 77 views
0

我在R學習文本挖掘並取得了相當不錯的成功。但我堅持如何處理複數。即我希望將「民族」和「民族」統一爲同一個詞,理想的情況下將「詞典」和「詞典」統一爲同一個詞。R文本挖掘 - 處理複數

x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.' 
+0

當問上SO一個問題,你被要求提供(1)樣本數據;(2)使用的是包的列表,( 3)別人可以複製和粘貼的代碼來重現您的問題。 – 2016-01-22 02:34:42

+0

看看這個由Bob Rudis編寫的GitHub軟件包(@hrbrmstr)https://github.com/hrbrmstr/pluralize –

+0

'SnowballC :: wordStem'可能在這裏有用處。 –

回答

4

一種可能的解決方案。在這裏,我用pacman包使溶液自包含:

if (!require("pacman")) install.packages("pacman"); library(pacman) 
p_load_gh('hrbrmstr/pluralize') 
p_load(quanteda) 

x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries"' 
singularize(unlist(tokenize(x))) 

## [1] "\""   "nation"  "\""   "and"  "\""   "nation"  "\""   
## [8] "to"   "be"   "counted" "a"   "the"  "same"  "word"  
## [15] "and"  "ideally" "\""   "dictionary" "\""   "and"  "\""   
## [22] "dictionary" "\""