我嘗試標記生物醫學文本,因此我決定使用http://nlp.stanford.edu/software/eventparser.shtml。我使用了獨立程序RunBioNLPTokenizer,它可以實現我想要的功能。BioNLP stanford - 標記化
現在,我想創建自己的程序,使用斯坦福圖書館。所以,我從下面描述的RunBioNLPTokenizer中讀取代碼。
package edu.stanford.nlp.ie.machinereading.domains.bionlp;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.util.Collection;
import java.util.List;
import java.util.Properties;
import edu.stanford.nlp.ie.machinereading.GenericDataSetReader;
import edu.stanford.nlp.ie.machinereading.msteventextractor.DataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.EpigeneticsDataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.GENIA11DataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.InfectiousDiseasesDataSet;
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.util.StringUtils;
/**
* Standalone program to run our BioNLP tokenizer and save its output
*/
public class RunBioNLPTokenizer extends GenericDataSetReader {
public static void main(String[] args) throws IOException {
Properties props = StringUtils.argsToProperties(args);
String basePath = props.getProperty("base.directory", "/u/nlp/data/bioNLP/2011/originals/");
DataSet dataset = new GENIA11DataSet();
dataset.getFilesystemInformation().setTokenizer("stanford");
runTokenizerForDirectory(dataset, basePath + "genia/training");
runTokenizerForDirectory(dataset, basePath + "genia/development");
runTokenizerForDirectory(dataset, basePath + "genia/testing");
dataset = new EpigeneticsDataSet();
dataset.getFilesystemInformation().setTokenizer("stanford");
runTokenizerForDirectory(dataset, basePath + "epi/training");
runTokenizerForDirectory(dataset, basePath + "epi/development");
runTokenizerForDirectory(dataset, basePath + "epi/testing");
dataset = new InfectiousDiseasesDataSet();
dataset.getFilesystemInformation().setTokenizer("stanford");
runTokenizerForDirectory(dataset, basePath + "infect/training");
runTokenizerForDirectory(dataset, basePath + "infect/development");
runTokenizerForDirectory(dataset, basePath + "infect/testing");
}
private static void runTokenizerForDirectory(DataSet dataset, String path) throws IOException {
System.out.println("Input directory: " + path);
BioNLPFormatReader reader = new BioNLPFormatReader();
for (File rawFile : reader.getRawFiles(path)) {
System.out.println("Input filename: " + rawFile.getName());
String rawText = IOUtils.slurpFile(rawFile);
String docId = rawFile.getName().replace("." + BioNLPFormatReader.TEXT_EXTENSION, "");
String parentPath = rawFile.getParent();
runTokenizer(dataset.getFilesystemInformation().getTokenizedFilename(parentPath, docId), rawText);
}
}
private static void runTokenizer(String tokenizedFilename, String text) {
System.out.println("Tokenized filename: " + tokenizedFilename);
Collection<String> sentences = BioNLPFormatReader.splitSentences(text);
PrintStream os = null;
try {
os = new PrintStream(new FileOutputStream(tokenizedFilename));
} catch (IOException e) {
System.err.println("ERROR: cannot save online tokenization to " + tokenizedFilename);
e.printStackTrace();
System.exit(1);
}
for (String sentence : sentences) {
BioNLPFormatReader.BioNLPTokenizer tokenizer = new BioNLPFormatReader.BioNLPTokenizer(sentence);
List<CoreLabel> tokens = tokenizer.tokenize();
for (CoreLabel l : tokens) {
os.print(l.word() + " ");
}
os.println();
}
os.close();
}
}
我寫了下面的代碼。我實現了將文本拆分爲句子,但我無法使用BioNLPTokenizer,因爲它在RunBioNLPTokenizer中使用。
public static void main(String[] args) throws Exception {
// TODO code application logic here
Collection<String> c =BioNLPFormatReader.splitSentences("..");
for (String sentence : c) {
System.out.println(sentence);
BioNLPFormatReader.BioNLPTokenizer x = BioNLPFormatReader.BioNLPTokenizer(sentence);
}
}
我把這個錯誤
異常在線程 「主」 了java.lang.RuntimeException:不可編譯的源代碼 - edu.stanford.nlp.ie.machinereading.domains.bionlp.BioNLPFormatReader.BioNLPTokenizer已保護訪問edu.stanford.nlp.ie.machinereading.domains.bionlp.BioNLPFormatReader
我的問題是。如何在不使用RunBioNLPTokenizer的情況下根據斯坦福圖書館標記生物醫學句子?
謝謝你的回答。我解決了這個問題(我認爲)。我做了我的課以擴展BioNLPFormatReader。這對我有效。我已經讀過,這是測試版本。圖書館中是否有生物醫學文本的標記器? –
很高興聽到您發現瞭解決方法。我可能會說「大部分沒有維護」而不是「測試版」,因爲米海和我自己不在斯坦福了:)你指的是哪個庫? – dmcc
嗯,我的意思是斯坦福CoreNLP圖書館。但是,如果您對生物醫學中的標記化有所瞭解,我會很感激。預先感謝您:) –