有沒有一種算法可以用來從段落中提取簡單的句子?從複雜(混合)句子中提取簡單句子的算法?
我的最終目標是稍後在結果的簡單句子上運行另一個算法來確定作者的情緒。
我已經從Chae-Deug Park等來源研究過這些,但沒有討論過將簡單句子作爲訓練數據進行討論。
在此先感謝
有沒有一種算法可以用來從段落中提取簡單的句子?從複雜(混合)句子中提取簡單句子的算法?
我的最終目標是稍後在結果的簡單句子上運行另一個算法來確定作者的情緒。
我已經從Chae-Deug Park等來源研究過這些,但沒有討論過將簡單句子作爲訓練數據進行討論。
在此先感謝
我剛剛使用openNLP。
public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException,
InvalidFormatException {
InputStream is = new FileInputStream("resources/models/en-sent.bin");
SentenceModel model = new SentenceModel(is);
SentenceDetectorME sdetector = new SentenceDetectorME(model);
String[] sentDetect = sdetector.sentDetect(paragraph);
is.close();
return Arrays.asList(sentDetect);
}
例
//Failed at Hi.
paragraph = "Hi. How are you? This is Mike.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Door.Noone
paragraph = "Close the Door.Noone is out there";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//not able to break on noone
paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at dr.
paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.
paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.
paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));
它失敗時,纔會有人類的錯誤。例如。 「博士」縮寫應該有大寫字母D,並且在2個句子之間至少有1個空格。
您也可以通過以下方式使用RE來實現它;
public static List<String> breakIntoSentencesCustomRESplitter(String paragraph){
List<String> sentences = new ArrayList<String>();
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(paragraph);
while (reMatcher.find()) {
sentences.add(reMatcher.group());
}
return sentences;
}
例
paragraph = "Hi. How are you? This is Mike.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Door.Noone
paragraph = "Close the Door.Noone is out there";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Mr., mrs.
paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at dr.
paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at U.S.
paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
但錯誤是競爭力高。另一種方法是使用BreakIterator;
public static List<String> breakIntoSentencesBreakIterator(String paragraph){
List<String> sentences = new ArrayList<String>();
BreakIterator sentenceIterator =
BreakIterator.getSentenceInstance(Locale.ENGLISH);
BreakIterator sentenceInstance = sentenceIterator.getSentenceInstance();
sentenceInstance.setText(paragraph);
int end = sentenceInstance.last();
for (int start = sentenceInstance.previous();
start != BreakIterator.DONE;
end = start, start = sentenceInstance.previous()) {
sentences.add(paragraph.substring(start,end));
}
return sentences;
}
實施例:
paragraph = "Hi. How are you? This is Mike.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Door.Noone
paragraph = "Close the Door.Noone is out there";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Mr.
paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at dr.
paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
標杆:
看看Apache OpenNLP,它有一個句子的檢測器模塊。該文檔提供瞭如何從命令行和API使用它的示例。
你說的「簡單的句子」究竟是什麼意思?只是一個句子而不是一個段落 - 在這種情況下,您的問題是關於句子邊界檢測。或者只包含一個主謂語的句子(而不是一個複雜的句子,其中有從句等)?或者完全不同的東西? – jogojapan 2012-04-11 03:19:49
嗨jogojapan,是的,這是正確的,我的意思只是一個句子,而不是一個段落... – 2012-04-14 22:39:42
你沒有正確定義你的意思是一個簡單的句子,所以它很難讓任何人回答你的問題。也許你想用斯坦福分析器這樣的東西來得到每個句子的解析樹,並去除所有不屬於「NP VP」類型的句子,即構成名詞短語後跟動詞短語的句子(例如'[約翰] [坐在長凳上]','[瑪麗和吉爾] [吃了他們的三明治]等等) – 2012-04-17 07:21:00