@Tyler林克已經給出了答案,只需添加的removeWords()
另一條線,但這裏的一些詳細信息。
比方說,您的Excel文件被稱爲nuts.xls
,有字的一列這樣
stopwords
peanut
cashew
walnut
almond
macadamia
在R
你可以這樣進行
library(gdata) # package with xls import function
library(tm)
# now load the excel file with the custom stoplist, note a few of the arguments here
# to clean the data by removing spaces that excel seems to insert and prevent it from
# importing the characters as factors. You can use any args from read.table(), which is
# handy
nuts<-read.xls("nuts.xls", header=TRUE, stringsAsFactor=FALSE, strip.white=TRUE)
# now make some words to build a corpus to test for a two-step stopword removal process...
words1<- c("peanut, cashew, walnut, macadamia, apple, pear, orange, lime, mandarin, and, or, but")
words2<- c("peanut, cashew, walnut, almond, apple, pear, orange, lime, mandarin, if, then, on")
words3<- c("peanut, walnut, almond, macadamia, apple, pear, orange, lime, mandarin, it, as, an")
words.all<-data.frame(rbind(words1,words2,words3))
words.corpus<-Corpus(DataframeSource((words.all)))
# now remove the standard list of stopwords, like you've already worked out
words.corpus.nostopwords <- tm_map(words.corpus, removeWords, stopwords("english"))
# now remove the second set of stopwords, this time your custom set from the excel file,
# note that it has to be a reference to a character vector containing the custom stopwords
words.corpus.nostopwords <- tm_map(words.corpus.nostopwords, removeWords, nuts$stopwords)
# have a look to see if it worked
inspect(words.corpus.nostopwords)
A corpus with 3 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$words1
, , , , apple, pear, orange, lime, mandarin, , ,
$words2
, , , , apple, pear, orange, lime, mandarin, , ,
$words3
, , , , apple, pear, orange, lime, mandarin, , ,
成功!標準停用詞不見了,就像excel文件中的自定義列表中的單詞一樣。毫無疑問,還有其他方法可以做到這一點。
代替或補充,在'禁用詞(「英語」)'添加停止詞從Excel文件也是如此。你可以合併單詞的矢量來製作一個停用詞的矢量。這些不在雲端。 – 2011-12-23 20:25:14