0
我正在處理會議文件的大型數據集。我正計劃對此數據集執行文本挖掘和主題建模。該數據集包含35欄5151篇論文的7欄信息(包括摘要)。文本挖掘中的矩陣控制
names(compen)
[1] "Year.the.Paper.was.Presented" "Paper.Title"
[3] "Paper.Abstract" "Author.Name"
[5] "Author.s.Organization" "Reviewing.Committee.s.Code"
[7] "Reviewing.Committee.s.Name"
dim(compen)
[1] 35451 7
這裏是我的下面的文本挖掘代碼(完美的作品)。
library(tm)
mydata.corpus <- Corpus(VectorSource(compen$Paper.Abstract))
mydata.corpus <- tm_map(mydata.corpus, tolower)
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE)
my_stopwords <- c(stopwords('german'),"the", "due", "are", "not", "for", "this", "and", "that", "there", "beyond", "time", "from", "been", "both", "than", "has","now", "until", "all", "use", "two", "based", "between", "can", "different", "each", "have", "however", "its", "level", "more", "most","new", "number","one","other", "paper", "pavement", "such", "their", "these", "used", "using", "were", "when", "which", "with")
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)
mydata.corpus <- tm_map(mydata.corpus, removeNumbers)
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
mydata.dtm
dim(mydata.dtm)
findFreqTerms(mydata.dtm, lowfreq=5000)
問題從這裏開始。
term.freq <- rowSums(as.matrix(mydata.dtm))
Error: cannot allocate vector of size 7.7 Gb
In addition: Warning messages:
1: In vector(typeof(x$v), nr * nc) :
Reached total allocation of 8139Mb: see help(memory.size)
2: In vector(typeof(x$v), nr * nc) :
Reached total allocation of 8139Mb: see help(memory.size)
3: In vector(typeof(x$v), nr * nc) :
Reached total allocation of 8139Mb: see help(memory.size)
4: In vector(typeof(x$v), nr * nc) :
Reached total allocation of 8139Mb: see help(memory.size)
它肯定看起來像一個內存問題。我想知道是否有辦法控制矩陣,這樣記憶問題就不會上升。
你正在運行32位或64位R ?.使用'Sys.getenv(「R_ARCH」)'來查明。 – Borealis
@Borealis您的代碼會爲我生成一個空字符串。 – SabDeM
跨平臺版本是:'.Machine $ sizeof.pointer'。輸出值爲8表示您正在運行64位。 – Borealis