我有一個輸入文件(大小約31GB),其中包含有關某些產品的消費者評論,我試圖推理並找到相應的引理計數。該方法有點類似於Hadoop提供的WordCount示例。我有4個課程來進行處理:StanfordLemmatizer [包含來自Stanford的coreNLP軟件包v3.3.0的詞彙推理的好東西],WordCount [驅動程序],WordCountMapper [映射程序]和WordCountReducer [reducer]。運行Hadoop作業的java.lang.OutOfMemoryError
我已經測試了原始數據集的一個子集(以MB爲單位)的程序,它運行良好。不幸的是,當我在大小〜31GB的完整數據集上運行作業時,作業失敗。我檢查作業的日誌它包含在此:
java.lang.OutOfMemoryError: Java heap space at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:109) [...]
如何處理這有什麼建議?
注意:我使用的是預先配置了hadoop-0.18.0的Yahoo VM。我也嘗試分配更多的堆的解決方案,在這個線程中提到:out of Memory Error in Hadoop
WordCountMapper代碼:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final IntWritable one = new IntWritable(1);
private final Text word = new Text();
private final StanfordLemmatizer slem = new StanfordLemmatizer();
public void map(LongWritable key, Text value,
OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
if(line.matches("^review/(summary|text).*")) //if the current line represents a summary/text of a review, process it!
{
for(String lemma: slem.lemmatize(line.replaceAll("^review/(summary|text):.", "").toLowerCase()))
{
word.set(lemma);
output.collect(word, one);
}
}
}
}
謝謝曼寧教授的詳細解釋和建議。將嘗試他們,看看我是否可以管理一些解決方法:) – Aditya