使用MapReduce作業調用StanfordCoreNLP API

我想要使用MapReduce處理大量文檔，這個想法是將文件拆分爲映射器中的文檔並在還原器階段應用stanford coreNLP註釋器。我有一個相當簡單（標準）管道的「標記化，ssplit，pos，引理，ner」，並且reducer只是調用一個函數，將這些annotators應用到reducer傳遞的值並返回註釋（as字符串列表），但是生成的輸出是垃圾。使用MapReduce作業調用StanfordCoreNLP API

我觀察到，如果我從映射程序中調用註釋函數，那麼作業會返回預期的輸出結果，但是會打敗整個並行性遊戲。當我忽略reducer中獲取的值並僅將註釋器應用於虛擬字符串時，該作業也會返回預期的輸出。

這可能表明在進程中存在一些線程安全問題，但我無法弄清楚在哪裏，我的註釋函數是同步的，而管道是私有的。

有人可以提供一些關於如何解決這個問題的指針？

-Angshu

編輯：

這是我減速的樣子，希望這增加了更清晰

public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> { 
    public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { 
     while (values.hasNext()) { 
      output.collect(key, new Text(se.getExtracts(values.next().toString()).toString()));    
     } 
    } 
}

這是獲得提取碼：

final StanfordCoreNLP pipeline; 
public instantiatePipeline(){ 
    Properties props = new Properties(); 
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner"); 

} 


synchronized List<String> getExtracts(String l){ 
    Annotation document = new Annotation(l); 

    ArrayList<String> ret = new ArrayList<String>(); 

    pipeline.annotate(document); 

    List<CoreMap> sentences = document.get(SentencesAnnotation.class); 
    int sid = 0; 
    for(CoreMap sentence:sentences){ 
     sid++; 
     for(CoreLabel token: sentence.get(TokensAnnotation.class)){ 
      String word = token.get(TextAnnotation.class); 
      String pos = token.get(PartOfSpeechAnnotation.class); 
      String ner = token.get(NamedEntityTagAnnotation.class); 
      String lemma = token.get(LemmaAnnotation.class); 

      Timex timex = token.get(TimeAnnotations.TimexAnnotation.class); 

      String ex = word+","+pos+","+ner+","+lemma; 
      if(timex!=null){ 
       ex = ex+","+timex.tid(); 
      } 
      else{ 
       ex = ex+","; 
      } 
      ex = ex+","+sid; 
      ret.add(ex); 
     } 
    }

來源

2014-06-28 Angshu

我想你需要提供更多關於你的實現的細節（如你的代碼）。 – Daniel

把我的代碼，將不勝感激任何指針。 – Angshu

基於實現，我認爲如果有任何問題，應該在第一個代碼（MapReduce的代碼）中。如果不是斯坦福的註釋，你可以調用一個簡單的函數嗎？ – Daniel

我解決了這個問題，實際上問題是在文件中的文本編碼從（將其轉換爲文本導致進一步的損壞，我猜）這是造成標記化和溢出垃圾的問題。我正在清理輸入字符串並應用嚴格的UTF-8編碼，而且現在工作正常。

來源

2014-06-30 06:46:17 Angshu

選擇它作爲答案！ :) – Daniel

使用MapReduce作業調用StanfordCoreNLP API

回答

相關問題