1
我有一個tm文檔語料庫和一個單詞列表。我想在語料庫上運行一個for
循環,以便循環順序地從語料庫中刪除列表中的每個單詞。在沒有丟失語料庫結構的情況下循環通過tm語料庫
某些複製數據:現在
library(tm)
m <- cbind(c("Apple blue two","Pear yellow five","Banana yellow two"),
c(1, 2, 3))
tm_corpus <- Corpus(VectorSource(m[,1]))
words <- as.list(c("Apple", "yellow", "two"))
tm_corpus
是由3個文件的語料庫對象:
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 3
words
是3個字的清單:
[[1]]
[1] "Apple"
[[2]]
[1] "yellow"
[[3]]
[1] "two"
我有試了三個不同的循環。第一個是:
tm_corpusClean <- tm_corpus
for (i in seq_along(tm_corpusClean)) {
for (u in seq_along(words)) {
tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords, words[[u]])
}
}
哪個返回以下錯誤7次(編號爲1-7):
Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions
In addition: Warning messages:
1: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,
words[[u]]) :
number of items to replace is not a multiple of replacement length
2: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,
words[[u]]) :
number of items to replace is not a multiple of replacement length
[...]
第二個是:
tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
for (u in seq_along(tm_corpusClean)) {
tm_corpusClean[u] <- tm_map(tm_corpusClean[u], removeWords, words[[i]])
}
}
返回錯誤:
Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions
最後一個循環是:
tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
tm_corpusClean <- tm_map(tm_corpusClean, removeWords, words[[i]])
}
這實際上返回名爲tm_corpusClean
一個對象,但這個對象只返回第一個文件,而不是所有的原始三個:
inspect(tm_corpusClean[[1]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 6
blue
我要去哪裏錯了?
嗨,亞當,謝謝你的回答。你的代碼的工作,但給我NA的,而不是輸出你目前的位置: 'OBJ1 < - tm_map(tm_corpus,removeWords,不公開(字)) sapply(OBJ1,'[', 「內容」)' ' [1]不適用不適用 obj2 < - lapply(單詞,函數(單詞)tm_map(tm_corpus, removeWords,單詞)) sapply(obj2,function(x)sapply(x,''''「content」) ) [1] [2] [3] [1,] NA NA NA [2,1] NA NA NA [3,] NA NA NA' 對不起,無法找出如何添加換行符。 – Rnout
對於'obj1 < - tm_map(tm_corpus,removeWords,unlist(words))',如果你要檢查'obj1 [[1]] $ content',你得到了什麼? –
'obj1 [[1]] $ content'確實返回'[1]「blue」',所以NA只在運行'sapply(obj1,''''content「)'後出現,給出了[[1] NA NA NA'。但它似乎對語料庫本身起作用。 :) – Rnout