2015-10-02 98 views
1

我不知道我的問題是否容易回答,但讓我們問一下。我使用R作爲語料庫語言學,我想用正確表達式來匹配,使用「exact.matches」(參見St. Th。Gries)。問題是,當我讓R運行腳本時,它凍結了很長時間,我的電腦也凍結了。所以我必須用電腦的電源按鈕來重啓所有的東西。R凍結,我的電腦也是

我想嘗試分析的是100個文本(以txt格式)的集合。整個捆綁包是17,254,537個令牌,但我一直試圖運行20個文件的代碼。同樣的問題:一切都凍結。代碼如下:

rm(list=ls(all=T)) 

setwd("C:/Users/Christophe/Documents/Doctorat_ULg/Corpora/Dutch/Gutenberg_corpus_NL") 
source("C:/_qclwr/_scripts/_scripts_code-exerciseboxes_chapters_3-5/exact_matches_new.R") 

corpus.files.1<-choose.files() # to load the first 58 text files 
corpus.files.2<-choose.files() # to load the 42 other files 
whole.corpus.file<-c(corpus.files.1, corpus.files.2) # to concatenate everything into one vector 
all.matches.verbs<-vector()  

for(i in whole.corpus.files) { 
    current.corpus.file<-scan(i, what="char", sep="\n", quiet=T) 
    current.matches.verbs<-exact.matches("aan<prep>", current.corpus.file, case.sens=F, pcre=T) 
    if(length(current.matches.verbs)==0) { next } 
    all.matches.verbs<-append(all.matches.verbs, current.matches.verbs) 
} 

有沒有簡單的方法來解決這個問題?這似乎是一個記憶問題。我輸入以下內容,如果它可以幫助:

> memory.size() 
[1] 35.02 
> memory.limit() 
[1] 3976 
> gc() 
      used (Mb) gc trigger (Mb) max used (Mb) 
Ncells 558406 29.9  818163 43.7 741108 39.6 
Vcells 1039743 8.0 1757946 13.5 1300290 10.0 

我提前感謝您的寶貴幫助。

最好,

CBechet。

+5

經典的錯誤:你正在循環中增長一個對象。閱讀[R地獄]的第二圈(http://www.burns-stat.com/pages/Tutor/R_inferno.pdf)。 – Roland

+0

在進入循環之前預先定義對象的大小 –

+0

如果我作弊並試圖使用外部硬盤驅動器(1TB),即使它不能解決增長對象的問題,您是否認爲它可以工作? – CBechet

回答

0

有一種替代for循環:

rm(list=ls(all=T)) 

setwd("C:/Users/Christophe/Documents/Doctorat_ULg/Corpora/Dutch/Gutenberg_corpus_NL") 
source("C:/_qclwr/_scripts/_scripts_code-exerciseboxes_chapters_3-5/exact_matches_new.R") 

corpus.files.1<-choose.files() # loads the first set of corpus files 
corpus.files.2<-choose.files() # loads the second set of corpus files 
whole.corpus.file<-c(corpus.files.1, corpus.files.2) # concatenate all the corpus files into one vector 

whole.text <-unlist(lapply(whole.corpus.file, function(x) scan(x, what="char", sep="\n", quiet=T))) # reads the content of the files in the vector 

而且數據還是太大(和我不使用一個for循環):

Error: cannot allocate vector of size 4.3 Mb 
In addition: Warning messages: 
1: In substr(lines, if (characters.around != 0) starts - characters.around else 1, : 
    Reached total allocation of 3976Mb: see help(memory.size) 
2: In substr(lines, if (characters.around != 0) starts - characters.around else 1, : 
    Reached total allocation of 3976Mb: see help(memory.size) 
3: In substr(lines, if (characters.around != 0) starts - characters.around else 1, : 
    Reached total allocation of 3976Mb: see help(memory.size) 
4: In substr(lines, if (characters.around != 0) starts - characters.around else 1, : 
    Reached total allocation of 3976Mb: see help(memory.size)