R使用％中的％來移除字符向量中的停用詞％

我有一個字符串數據框，我想從中刪除停用詞。我試圖避免使用tm包，因爲它是一個大型數據集，tm似乎運行速度有點慢。我正在使用tmstopword字典。R使用％中的％來移除字符向量中的停用詞％

library(plyr) 
library(tm) 

stopWords <- stopwords("en") 
class(stopWords) 

df1 <- data.frame(id = seq(1,5,1), string1 = NA) 
head(df1) 
df1$string1[1] <- "This string is a string." 
df1$string1[2] <- "This string is a slightly longer string." 
df1$string1[3] <- "This string is an even longer string." 
df1$string1[4] <- "This string is a slightly shorter string." 
df1$string1[5] <- "This string is the longest string of all the other strings." 

head(df1) 
df1$string1 <- tolower(df1$string1) 
str1 <- strsplit(df1$string1[5], " ") 

> !(str1 %in% stopWords) 
[1] TRUE

這不是我要找的答案。我試圖在stopWords載體中得到一個載體或字符串NOT。

我在做什麼錯？

來源

2013-03-06 screechOwl

問題很明顯：string nbr 5在語法上不正確。 :-)。好吧，我認爲Arun是正確的，假設「單詞」嚴格意味着一串沒有空格的字符。在'df1 $ string'的所有元素上運行他的代碼後，如果你只是想要一個列表，而不是單詞的數量，你可以做'unique'。 – 2013-03-06 18:58:25

您沒有正確訪問列表，並且沒有從%in%（它給出TRUE/FALSE的邏輯向量）的結果中獲取元素。你應該做這樣的事情：

unlist(str1)[!(unlist(str1) %in% stopWords)]

（或）

str1[[1]][!(str1[[1]] %in% stopWords)]

對於整個data.frame DF1，你可以這樣做：

'%nin%' <- Negate('%in%') 
lapply(df1[,2], function(x) { 
    t <- unlist(strsplit(x, " ")) 
    t[t %nin% stopWords] 
}) 

# [[1]] 
# [1] "string" "string." 
# 
# [[2]] 
# [1] "string" "slightly" "string." 
# 
# [[3]] 
# [1] "string" "string." 
# 
# [[4]] 
# [1] "string" "slightly" "shorter" "string." 
# 
# [[5]] 
# [1] "string" "string" "strings."

來源

2013-03-06 17:18:53 Arun

我沒有意識到str1是作爲一個列表輸出的，我以爲它是一個矢量，謝謝。 – screechOwl 2013-03-06 17:26:04

感謝您使用'Negate' - 我完全忘記了'funprog'套裝的好東西。 – 2013-03-06 20:55:03

使用'setdiff'會更簡單，你應該在'strsplit'的結果中使用'lapply'：'lapply（strsplit（df1 $ string，「」），setdiff，stopWords）''。唯一的缺點是你得到獨特的文字。 – hadley 2013-03-06 22:30:40

第一。你應該選擇不公開str1或使用lapply如果str1是矢量：

!(unlist(str1) %in% words) 
#> [1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE

二。複雜的解決方案：

string <- c("This string is a string.", 
      "This string is a slightly longer string.", 
      "This string is an even longer string.", 
      "This string is a slightly shorter string.", 
      "This string is the longest string of all the other strings.") 
rm_words <- function(string, words) { 
    stopifnot(is.character(string), is.character(words)) 
    spltted <- strsplit(string, " ", fixed = TRUE) # fixed = TRUE for speedup 
    vapply(spltted, function(x) paste(x[!tolower(x) %in% words], collapse = " "), character(1)) 
} 
rm_words(string, tm::stopwords("en")) 
#> [1] "string string."     "string slightly longer string." "string even longer string."  
#> [4] "string slightly shorter string." "string longest string strings."

來源

2016-01-05 08:48:23

R使用％中的％來移除字符向量中的停用詞％

回答

相關問題