我正在閱讀並解析純文本文件,逐行閱讀,將每行分解爲句子,將每個句子拆分爲單詞並將它們存儲到每個句子和每個文檔的List中。爲什麼此GC開銷限制超出?
輸入文件包含500萬行,所以我設置ArrayList的大小爲5005000.在我的IntelliJ堆大小低於:
# custom IntelliJ IDEA VM options
-Xms128m
-Xmx8192m
-XX:ReservedCodeCacheSize=240m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
我的電腦有15G的RAM。讀取4500000行後(如print語句所示),它變得非常慢。幾分鐘後,我收到:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
每一行(解析爲一個文檔)是短暫的,所以我的15G內存應該足夠容納更多。文本文件大小隻有800MB。當我在Windows 10中觀看我的性能監視器時,它只顯示大約55%的內存被使用,這表示當它死亡時仍有大量內存可用。
請注意,在下面的代碼中,我使用'sentence.toCharArray()',因爲它不是英語,所以我基本上將每個字符視爲我實現中的單詞。
只有500萬行,爲什麼死了?
List<List<List<String>>> allWords = new ArrayList<>(5005000);
System.out.println("Load text from file: ");
try {
BufferedReader br = Utils.fileReader(filePath);
String line;
int lineNo = 0;
while ((line = br.readLine()) != null) {
List<List<String>> wordsPerDoc = new ArrayList<>();
for (String sentence : segment(line)) {
List<String> wordsPerSentence = new ArrayList<>();
for (Character c : sentence.toCharArray()) {
wordsPerClause.add(Character.toString(c));
}
wordsPerDoc.add(wordsPerSentence);
}
allWords.add(wordsPerDoc);
lineNo++;
if(lineNo % 500000 ==0) {
System.out.println(lineNo);
}
}
System.out.println("Loaded text from file. ");
br.close();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
public List<String> segment(final String line) {
List<String> sentences = new ArrayList<>();
StringTokenizer tokenizer = new StringTokenizer(line, OtherConstants.BASIC_TOKENIZATION_DELIMITER, true);
while (tokenizer.hasMoreTokens()) {
String word = tokenizer.nextToken();
sentences.add(word);
}
return sentences;
}