2016-10-05 55 views
0

我嘗試標記生物醫學文本,因此我決定使用http://nlp.stanford.edu/software/eventparser.shtml。我使用了獨立程序RunBioNLPTokenizer,它可以實現我想要的功能。BioNLP stanford - 標記化

現在,我想創建自己的程序,使用斯坦福圖書館。所以,我從下面描述的RunBioNLPTokenizer中讀取代碼。

package edu.stanford.nlp.ie.machinereading.domains.bionlp; 

import java.io.File; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import java.io.PrintStream; 
import java.util.Collection; 
import java.util.List; 
import java.util.Properties; 

import edu.stanford.nlp.ie.machinereading.GenericDataSetReader; 
import edu.stanford.nlp.ie.machinereading.msteventextractor.DataSet; 
import edu.stanford.nlp.ie.machinereading.msteventextractor.EpigeneticsDataSet; 
import edu.stanford.nlp.ie.machinereading.msteventextractor.GENIA11DataSet; 
import edu.stanford.nlp.ie.machinereading.msteventextractor.InfectiousDiseasesDataSet; 
import edu.stanford.nlp.io.IOUtils; 
import edu.stanford.nlp.ling.CoreLabel; 
import edu.stanford.nlp.util.StringUtils; 

/** 
* Standalone program to run our BioNLP tokenizer and save its output 
*/ 
public class RunBioNLPTokenizer extends GenericDataSetReader { 

    public static void main(String[] args) throws IOException { 
    Properties props = StringUtils.argsToProperties(args); 
    String basePath = props.getProperty("base.directory", "/u/nlp/data/bioNLP/2011/originals/"); 

    DataSet dataset = new GENIA11DataSet(); 
    dataset.getFilesystemInformation().setTokenizer("stanford"); 
    runTokenizerForDirectory(dataset, basePath + "genia/training"); 
    runTokenizerForDirectory(dataset, basePath + "genia/development"); 
    runTokenizerForDirectory(dataset, basePath + "genia/testing"); 

    dataset = new EpigeneticsDataSet(); 
    dataset.getFilesystemInformation().setTokenizer("stanford"); 
    runTokenizerForDirectory(dataset, basePath + "epi/training"); 
    runTokenizerForDirectory(dataset, basePath + "epi/development"); 
    runTokenizerForDirectory(dataset, basePath + "epi/testing"); 

    dataset = new InfectiousDiseasesDataSet(); 
    dataset.getFilesystemInformation().setTokenizer("stanford"); 
    runTokenizerForDirectory(dataset, basePath + "infect/training"); 
    runTokenizerForDirectory(dataset, basePath + "infect/development"); 
    runTokenizerForDirectory(dataset, basePath + "infect/testing"); 
    } 

    private static void runTokenizerForDirectory(DataSet dataset, String path) throws IOException { 
    System.out.println("Input directory: " + path); 
    BioNLPFormatReader reader = new BioNLPFormatReader();  
    for (File rawFile : reader.getRawFiles(path)) { 
     System.out.println("Input filename: " + rawFile.getName()); 
     String rawText = IOUtils.slurpFile(rawFile); 

     String docId = rawFile.getName().replace("." + BioNLPFormatReader.TEXT_EXTENSION, ""); 
     String parentPath = rawFile.getParent(); 

     runTokenizer(dataset.getFilesystemInformation().getTokenizedFilename(parentPath, docId), rawText); 
    } 
    } 

    private static void runTokenizer(String tokenizedFilename, String text) { 
    System.out.println("Tokenized filename: " + tokenizedFilename); 
    Collection<String> sentences = BioNLPFormatReader.splitSentences(text); 

    PrintStream os = null; 
    try { 
     os = new PrintStream(new FileOutputStream(tokenizedFilename)); 
    } catch (IOException e) { 
     System.err.println("ERROR: cannot save online tokenization to " + tokenizedFilename); 
     e.printStackTrace(); 
     System.exit(1); 
    } 

    for (String sentence : sentences) { 
     BioNLPFormatReader.BioNLPTokenizer tokenizer = new BioNLPFormatReader.BioNLPTokenizer(sentence); 
     List<CoreLabel> tokens = tokenizer.tokenize(); 
     for (CoreLabel l : tokens) { 
     os.print(l.word() + " "); 
     } 
     os.println(); 
    } 
    os.close(); 
    } 
} 

我寫了下面的代碼。我實現了將文本拆分爲句子,但我無法使用BioNLPTokenizer,因爲它在RunBioNLPTokenizer中使用。

public static void main(String[] args) throws Exception { 
    // TODO code application logic here 
    Collection<String> c =BioNLPFormatReader.splitSentences(".."); 
    for (String sentence : c) { 
    System.out.println(sentence); 
    BioNLPFormatReader.BioNLPTokenizer x = BioNLPFormatReader.BioNLPTokenizer(sentence); 
    } 
} 

我把這個錯誤

異常在線程 「主」 了java.lang.RuntimeException:不可編譯的源代碼 - edu.stanford.nlp.ie.machinereading.domains.bionlp.BioNLPFormatReader.BioNLPTokenizer已保護訪問edu.stanford.nlp.ie.machinereading.domains.bionlp.BioNLPFormatReader

我的問題是。如何在不使用RunBioNLPTokenizer的情況下根據斯坦福圖書館標記生物醫學句子?

回答

0

不幸的是,我們製作了BioNLPTokenizer a protected內部類,因此您需要編輯源代碼並將其訪問權限更改爲public

請注意,BioNLPTokenizer可能不是最通用的生物醫學句子tokenzier - 我會檢查輸出以確保它是合理的。我們針對BioNLP 2009/2011共享任務大量開發了它。

+0

謝謝你的回答。我解決了這個問題(我認爲)。我做了我的課以擴展BioNLPFormatReader。這對我有效。我已經讀過,這是測試版本。圖書館中是否有生物醫學文本的標記器? –

+0

很高興聽到您發現瞭解決方法。我可能會說「大部分沒有維護」而不是「測試版」,因爲米海和我自己不在斯坦福了:)你指的是哪個庫? – dmcc

+0

嗯,我的意思是斯坦福CoreNLP圖書館。但是,如果您對生物醫學中的標記化有所瞭解,我會很感激。預先感謝您:) –