TM結合語料庫

的名單我有我所獲取的web內容的URL列表，並列入到這TM語料庫：TM結合語料庫

library(tm) 
library(XML) 

link <- c(
"http://www.r-statistics.com/tag/hadley-wickham/",              
"http://had.co.nz/",                      
"http://vita.had.co.nz/articles.html",                 
"http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html",       
"http://www.analyticstory.com/hadley-wickham/" 
)    

create.corpus <- function(url.name){ 
doc=htmlParse(url.name) 
parag=xpathSApply(doc,'//p',xmlValue) 
if (length(parag)==0){ 
    parag="empty" 
} 
cc=Corpus(VectorSource(parag)) 
meta(cc,"link")=url.name 
return(cc) 
} 

link=catch$url 
cc <- lapply(link, create.corpus)

這讓我語料的「大名單」，每一個URL。結合逐一作品：

x=cc[[1]] 
y=cc[[2]] 
z=c(x,y,recursive=T) # preserved metadata 
x;y;z 
# A corpus with 8 text documents 
# A corpus with 2 text documents 
# A corpus with 10 text documents

但這變得不可行的有幾千語料的列表。那麼如何在保持元數據的同時將語料庫列表合併到一個語料庫中？

來源

2014-01-07 Henk

您可以使用do.call調用c：

do.call(function(...) c(..., recursive = TRUE), cc) 
# A corpus with 155 text documents

來源

2014-01-07 12:13:41

工程！從來沒有意識到你可以使用（...）這種方式。 – Henk

我不認爲tm提供任何內置功能的加入/合併胼很多。但畢竟一個語料庫是一個文檔列表，所以問題是如何將列表列表轉換爲列表。我會做創建使用所有文檔的新文集，然後手動分配薈萃：

y = Corpus(VectorSource(unlist(cc))) 
meta(y,'link') = do.call(rbind,lapply(cc,meta))$link

來源

2014-01-07 12:33:19 agstudy

您的代碼不起作用，因爲catch沒有定義，所以我不知道到底是什麼是應該做的。

但現在TM語料庫剛好可以放入一個載體，使一個大語料庫：https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/tm_combine

所以也許c(unlist(cc))會工作。我沒有辦法測試這是否會工作，因爲你的代碼沒有運行。

來源

2017-11-17 21:23:41 wordsforthewise

回答

相關問題