2015-07-13 75 views
0

我正在處理會議文件的大型數據集。我正計劃對此數據集執行文本挖掘和主題建模。該數據集包含35欄5151篇論文的7欄信息(包括摘要)。文本挖掘中的矩陣控制

names(compen) 
[1] "Year.the.Paper.was.Presented" "Paper.Title"     
[3] "Paper.Abstract"    "Author.Name"     
[5] "Author.s.Organization"  "Reviewing.Committee.s.Code" 
[7] "Reviewing.Committee.s.Name" 
dim(compen) 
[1] 35451  7 

這裏是我的下面的文本挖掘代碼(完美的作品)。

library(tm) 
mydata.corpus <- Corpus(VectorSource(compen$Paper.Abstract)) 
mydata.corpus <- tm_map(mydata.corpus, tolower) 
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE) 
my_stopwords <- c(stopwords('german'),"the", "due", "are", "not", "for", "this", "and", "that", "there", "beyond", "time", "from", "been", "both", "than", "has","now", "until", "all", "use", "two", "based", "between", "can", "different", "each", "have", "however", "its", "level", "more", "most","new", "number","one","other", "paper", "pavement", "such", "their", "these", "used", "using", "were", "when", "which", "with") 
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords) 
mydata.corpus <- tm_map(mydata.corpus, removeNumbers) 
mydata.dtm <- TermDocumentMatrix(mydata.corpus) 
mydata.dtm 
dim(mydata.dtm) 
findFreqTerms(mydata.dtm, lowfreq=5000) 

問題從這裏開始。

term.freq <- rowSums(as.matrix(mydata.dtm)) 
Error: cannot allocate vector of size 7.7 Gb 
In addition: Warning messages: 
1: In vector(typeof(x$v), nr * nc) : 
    Reached total allocation of 8139Mb: see help(memory.size) 
2: In vector(typeof(x$v), nr * nc) : 
    Reached total allocation of 8139Mb: see help(memory.size) 
3: In vector(typeof(x$v), nr * nc) : 
    Reached total allocation of 8139Mb: see help(memory.size) 
4: In vector(typeof(x$v), nr * nc) : 
    Reached total allocation of 8139Mb: see help(memory.size) 

它肯定看起來像一個內存問題。我想知道是否有辦法控制矩陣,這樣記憶問題就不會上升。

+0

你正在運行32位或64位R ?.使用'Sys.getenv(「R_ARCH」)'來查明。 – Borealis

+0

@Borealis您的代碼會爲我生成一個空字符串。 – SabDeM

+1

跨平臺版本是:'.Machine $ sizeof.pointer'。輸出值爲8表示您正在運行64位。 – Borealis

回答

0

這並不是很多數據,但它聽起來像加載它的方式在8GB系統上內存不足。但請試試這個:

require(quanteda) 
mydata.corpus <- corpus(compen$Paper.Abstract, 
         dovcars = compen[-which(names(compen)=="Paper.Abstract")]) 
mydata.dtm <- dfm(mydata.corpus, ignoredFeatures = my_stopwords) 
mydata.dtm 
topfeatures(mydata.dfm, 5000) 

目前它不保留字內連字符,但我們很可能很快會添加它作爲選項。如果您想爲您的問題使用quanteda,我很樂意爲您提供進一步的幫助。它適用於文檔級元數據(「docvars」),可以直接將「dfm」傳遞給所有主要主題建模包 - 請參閱help(convert, package = "quanteda")