將段落分解成句子 - 一個特例

我是用Java編程的新手。我想將一個文件中的段落拆分成句子並將它們寫入不同的文件中。此外，還應該有一種機制來確定哪個句子來自哪一段。到目前爲止，我使用的代碼如下所述。但是這個代碼打破：將段落分解成句子 - 一個特例

Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.

到

Former Secretary of Finance Dr. 
P.B. 
Jayasundera is being questioned by the police Financial Crime Investigation Division.

我怎樣才能糾正呢？提前致謝。

import java.io.*; 
class trial4{ 
    public static void main(String args[]) throws IOException 
{ 
FileReader fr = new FileReader("input.txt"); 
BufferedReader br = new BufferedReader(fr); 
String s; 
OutputStream out = new FileOutputStream("output10.txt"); 
         String token[]; 

while((s = br.readLine()) != null) 
    { 
     token = s.split("(?<=[.!?])\\s* "); 
     for(int i=0;i<token.length;i++) 
     { 
     byte buf[]=token[i].getBytes(); 
    for(int j=0;j<buf.length;j=j+1) 
     { 
           out.write(buf[j]); 
       if(j==buf.length-1) 
         out.write('\n'); 
      } 
     } 
     } 
     fr.close(); 
    } 
}

我引用的所有貼在StackOverflow上的類似的問題。但是這些答案無法幫助我解決這個問題。

來源

2015-11-08 sugz

這將是合理很難做到，除非你能正式的「這一時期標誌着一個縮寫」 VS「這個時期標誌着一個句子的末尾」的一些概念。 –

如何結合使用負回顧後與替換。簡單地說：將所有沒有「特殊」的行結束符替換爲換行符後跟換行符。

的「已知的縮寫」 A名單將是必要的。無法保證這些內容可以存在多長時間，也不能保證一行字末尾可能有多短。（見？「是」，如果很短了！）

class trial4{ 
    public static void main(String args[]) throws IOException { 
    FileReader fr = new FileReader("input.txt"); 
    BufferedReader br = new BufferedReader(fr); 
    PrintStream out = new PrintStream(new FileOutputStream("output10.txt")); 

    String s = br.readLine(); 
    while(s != null) { 
     out.print(  //Prints newline after each line in any case 
      s.replaceAll("(?i)"    //Make the match case insensitive 
       + "(?<!"     //Negative lookbehind 
       + "(\\W\\w)|"   //Single non-word followed by word character (P.B.) 
       + "(\\W\\d{1,2})|"  //one or two digits (dates!) 
       + "(\\W(dr|mr|mrs|ms))" //List of known abbreviations 
       + ")"      //End of lookbehind      
       +"([!?\\.])"    //Match end-ofsentence 
        , "$5"     //Replace with end-of-sentence found 
          +System.lineSeparator())); //Add newline if found 
     s = br.readLine(); 
    } 
    } 
}

來源

2015-11-08 10:38:27 Jan

它工作完美！非常感謝！ :) – sugz

是的！ :)。我還有一個問題。如果這些段落在Excel表單中怎麼辦？假設一個單元格包含一個段落。分割後，這些句子可以在文本文件/ Excel表格中。（無論哪種方式）。那麼，這是如何實現的？ – sugz

嗨，我很抱歉再次打擾。但是當我給出像3.2這樣的值時，它現在分成不同的句子。我以前沒有這個問題。 – sugz

正如評論所說的「這將是合理硬」，打破文本段落沒有正式的要求。看看BreakIterator - 特別是SentenceInstance。您可能會推出自己的BreakIterator，因爲它與使用正則表達式打破相同，只是它更抽象。或嘗試找到像http://deeplearning4j.org/sentenceiterator.html這樣的第三方解決方案，這可以是訓練標記化您的輸入。

例如用的BreakIterator：

String str = "Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division."; 

BreakIterator bilus = BreakIterator.getSentenceInstance(Locale.US); 
bilus.setText(str); 

int last = bilus.first(); 
int count = 0; 

while (BreakIterator.DONE != last) 
{ 
    int first = last;  
    last = bilus.next(); 

    if (BreakIterator.DONE != last) 
    { 
     String sentence = str.substring(first, last); 
     System.out.println("Sentence:" + sentence); 
     count++; 
    } 
} 
System.out.println("" + count + " sentences found.");

來源

2015-11-08 10:39:28 Willmore

將段落分解成句子 - 一個特例

回答

相關問題