如何使用tm_map將元數據添加到tm語料庫對象

我一直在閱讀不同的問題/答案（特別是here和here），但沒有管理任何適用於我的情況。如何使用tm_map將元數據添加到tm語料庫對象

我有一個屬性ID，作者，文本，如11,390行矩陣：

library(tm) 

m <- cbind(c("01","02","03","04","05","06"), 
      c("Author1","Author2","Author2","Author3","Author3","Auhtor4"), 
      c("Text1","Text2","Text3","Text4","Text5","Text6"))

我想創建一個tm語料庫出來。我可以快速創建我的

tm_corpus <- Corpus(VectorSource(m[,3]))

語料庫終止執行我的11,390行矩陣

user system elapsed 
    2.383 0.175 2.557

但是當我試圖將元數據添加到語料庫與

meta(tm_corpus, type="local", tag="Author") <- m[,2]

的執行時間超過15分鐘並計數（然後我停止執行）。

根據討論here很可能會大大減少處理語料庫的時間tm_map;像

tm_corpus <- tm_map(tm_corpus, addMeta, m[,2])

但我不知道如何做到這一點。也許這將是像

addMeta <- function(text, vector) { 
    meta(text, tag="Author") = vector[??] 
    text 
}

一方面如何傳遞到tm_map值的矢量被分配到語料庫的每個文本？我應該從循環內調用函數嗎？我是否應該在vapply中附上tm_map函數？

來源

2014-01-10 CptNemo

對我來說，meta調用起作用。（tm_corpus，type =「corpus」，tag =「Author」）< - m [，2] – user944351

是的tm_map是更快，它是要走的路。你應該在這裏用一個全局計數器來使用它。

auths <- paste0('Author',seq(11390)) 
i <- 0 
tm_corpus = tm_map(tm_corpus, function(x) { 
    i <<- i +1 
    meta(x, "Author") <- m[i,2] 
    x 
})

來源

2014-01-10 06:33:33 agstudy

您是否已經試過優秀的readTabular？

## your sample data 
matrix <- cbind(c("01","02","03","04","05","06"), 
     c("Author1","Author2","Author2","Author3","Author3","Auhtor4"), 
     c("Text1","Text2","Text3","Text4","Text5","Text6")) 

## simple transformations 
matrix <- as.data.frame(matrix) 
names(matrix) <- c("id", "author", "content")

現在您的前矩陣現在data.frame可以使用readTabular作爲語料庫輕鬆讀取。 ReadTabular希望您定義一個Reader，它本身需要映射。在你的映射中，「內容」指向文本數據和其他名稱 - 好 - 元。

## define myReader, which will be used in creation of Corpus 
myReader <- readTabular(mapping=list(id="id", author="author", content="content"))

現在語料庫的創建與往常一樣，除了小的變化：

## create the corpus 
tm_corpus <- DataframeSource(matrix) 
tm_corpus <- Corpus(tm_corpus, 
    readerControl = list(reader=myReader))

現在來看看第一個項目的內容和元數據：

lapply(tm_corpus, as.character) 
lapply(tm_corpus, meta) 
## output just as expected.

這應該是快速的，因爲它是包裝和極其適應的一部分。在我自己的項目中，我用了大約20個變量的data.table，它的功能就像一個魅力。

但是，我無法提供您已經批准爲合適答案的基準。我只是猜測它更快，更高效。

來源

2015-09-07 20:37:22

對於'lapply（tm_corpus，as.character）'，我用了「語料庫」這是顯示嗎？我現在看到一切都在'lapply（tm_corpus，meta）'現在 –

第一個命令顯示文本（只有文本），第二個是「meta」信息，至少在我的設置中。你有不同的結果嗎？ –

是的，它將所有內容都放入元數據中 –

如何使用tm_map將元數據添加到tm語料庫對象

回答

相關問題