我正在嘗試使用Great Quanted包從大型語料庫(R中大小約1Gb的對象大小)構建n-gram。 我沒有可用的雲資源,因此我使用自己的筆記本電腦(Windows和/或Mac,12Gb RAM)進行計算。使用R和Quanteda計算大型語料庫上的n-gram
如果我將數據分解成小塊,代碼就可以工作,並且我得到了不同大小n-gram的(部分)dfm,但是當我嘗試在整個語料庫上運行代碼時,不幸的是,該文集的大小,並得到以下錯誤(unigram進行示例代碼,單個詞):
> dfm(corpus, verbose = TRUE, stem = TRUE,
ignoredFeatures = stopwords("english"),
removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 4,269,678 documents
... indexing features:
Error: cannot allocate vector of size 1024.0 Mb
In addition: Warning messages:
1: In unique.default(allFeatures) :
Reached total allocation of 11984Mb: see help(memory.size)
即使我嘗試建立的n-gram,其中n> 1更糟:
> dfm(corpus, ngrams = 2, concatenator=" ", verbose = TRUE,
ignoredFeatures = stopwords("english"),
removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
Error: C stack usage 19925140 is too close to the limit
我發現this related post,但它看起來是一個密集矩陣強制的問題,後來解決d,這對我的情況沒有幫助。
有沒有更好的方法來處理這與有限的內存量,而不必將語料庫數據分解成片斷?
[編輯]按照要求,sessionInfo()數據:
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 dplyr_0.4.3 quanteda_0.9.4
loaded via a namespace (and not attached):
[1] magrittr_1.5 R6_2.1.2 assertthat_0.1 Matrix_1.2-3 rsconnect_0.4.2 DBI_0.3.1
[7] parallel_3.2.3 tools_3.2.3 Rcpp_0.12.3 stringi_1.0-1 grid_3.2.3 chron_2.3-47
[13] lattice_0.20-33 ca_0.64
您使用的是什麼版本的量子?你可以發送你的sessionInfo()輸出嗎? –
@KenBenoit我試着用Mac和Windows機器。 – Federico
@KenBenoit評論是不可讀的,所以我編輯並添加到上面的帖子。謝謝! – Federico