在沒有丟失語料庫結構的情況下循環通過tm語料庫

我有一個tm文檔語料庫和一個單詞列表。我想在語料庫上運行一個for循環，以便循環順序地從語料庫中刪除列表中的每個單詞。在沒有丟失語料庫結構的情況下循環通過tm語料庫

某些複製數據：現在

library(tm) 
m <- cbind(c("Apple blue two","Pear yellow five","Banana yellow two"), 
      c(1, 2, 3)) 
tm_corpus <- Corpus(VectorSource(m[,1])) 
words <- as.list(c("Apple", "yellow", "two"))

tm_corpus是由3個文件的語料庫對象：

<<SimpleCorpus>> 
Metadata: corpus specific: 1, document level (indexed): 0 
Content: documents: 3

words是3個字的清單：

[[1]] 
[1] "Apple" 

[[2]] 
[1] "yellow" 

[[3]] 
[1] "two"

我有試了三個不同的循環。第一個是：

tm_corpusClean <- tm_corpus 
for (i in seq_along(tm_corpusClean)) { 
    for (u in seq_along(words)) { 
    tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords, words[[u]]) 
    } 
}

哪個返回以下錯誤7次（編號爲1-7）：

Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions 
In addition: Warning messages: 
1: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,     
words[[u]]) : 
    number of items to replace is not a multiple of replacement length 
2: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,   
words[[u]]) : 
    number of items to replace is not a multiple of replacement length 
[...]

第二個是：

tm_corpusClean <- tm_corpus 
for (i in seq_along(words)) { 
    for (u in seq_along(tm_corpusClean)) { 
    tm_corpusClean[u] <- tm_map(tm_corpusClean[u], removeWords, words[[i]]) 
    } 
}

返回錯誤：

Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions

最後一個循環是：

tm_corpusClean <- tm_corpus 
for (i in seq_along(words)) { 
    tm_corpusClean <- tm_map(tm_corpusClean, removeWords, words[[i]]) 
}

這實際上返回名爲tm_corpusClean一個對象，但這個對象只返回第一個文件，而不是所有的原始三個：

inspect(tm_corpusClean[[1]]) 

<<PlainTextDocument>> 
Metadata: 7 
Content: chars: 6 

blue

我要去哪裏錯了？

來源

2017-04-25 Rnout

之前我們去的順序去除，在你的例子，如果測試tm_map工作：

obj1 <- tm_map(tm_corpus, removeWords, unlist(words)) 
sapply(obj1, `[`, "content") 

$`1.content` 
[1] " blue " 

$`2.content` 
[1] "Pear five" 

$`3.content` 
[1] "Banana "

接下來，使用lapply順序一次刪除一個字，即"Apple", "yellow", "two"：

obj2 <- lapply(words, function(word) tm_map(tm_corpus, removeWords, word)) 
sapply(obj2, function(x) sapply(x, `[`, "content")) 

      [,1]    [,2]    [,3]    
1.content " blue two"   "Apple blue two" "Apple blue "  
2.content "Pear yellow five" "Pear five"  "Pear yellow five" 
3.content "Banana yellow two" "Banana two" "Banana yellow "

請注意，生成的語料庫位於嵌套列表中（兩個Sapply用於查看內容的原因）。

來源

2017-04-25 08:36:33

嗨，亞當，謝謝你的回答。你的代碼的工作，但給我NA的，而不是輸出你目前的位置： 'OBJ1 < - tm_map（tm_corpus，removeWords，不公開（字）） sapply（OBJ1，'['，「內容」）' ' [1]不適用不適用 obj2 < - lapply（單詞，函數（單詞）tm_map（tm_corpus， removeWords，單詞）） sapply（obj2，function（x）sapply（x，''''「content」）） [1] [2] [3] [1，] NA NA NA [2,1] NA NA NA [3，] NA NA NA' 對不起，無法找出如何添加換行符。 – Rnout

對於'obj1 < - tm_map（tm_corpus，removeWords，unlist（words））'，如果你要檢查'obj1 [[1]] $ content'，你得到了什麼？ –

'obj1 [[1]] $ content'確實返回'[1]「blue」'，所以NA只在運行'sapply（obj1，''''content「）'後出現，給出了[[1] NA NA NA'。但它似乎對語料庫本身起作用。 :) – Rnout

在沒有丟失語料庫結構的情況下循環通過tm語料庫

回答

相關問題