R文本挖掘 - 處理複數

我在R學習文本挖掘並取得了相當不錯的成功。但我堅持如何處理複數。即我希望將「民族」和「民族」統一爲同一個詞，理想的情況下將「詞典」和「詞典」統一爲同一個詞。R文本挖掘 - 處理複數

x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.'

來源

2016-01-22 vw262363

當問上SO一個問題，你被要求提供（1）樣本數據;（2）使用的是包的列表，（ 3）別人可以複製和粘貼的代碼來重現您的問題。 – 2016-01-22 02:34:42

看看這個由Bob Rudis編寫的GitHub軟件包（@hrbrmstr）https://github.com/hrbrmstr/pluralize –

'SnowballC :: wordStem'可能在這裏有用處。 –

一種可能的解決方案。在這裏，我用pacman包使溶液自包含：

if (!require("pacman")) install.packages("pacman"); library(pacman) 
p_load_gh('hrbrmstr/pluralize') 
p_load(quanteda) 

x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries"' 
singularize(unlist(tokenize(x))) 

## [1] "\""   "nation"  "\""   "and"  "\""   "nation"  "\""   
## [8] "to"   "be"   "counted" "a"   "the"  "same"  "word"  
## [15] "and"  "ideally" "\""   "dictionary" "\""   "and"  "\""   
## [22] "dictionary" "\""

來源

2016-01-22 02:49:45

R文本挖掘 - 處理複數

回答

相關問題