將句子分成單詞和標點符號

我需要將類Sentence解析爲單詞和標點符號（空格被認爲是標點符號），然後將其全部添加到常規的ArrayList<Sentence>中。將句子分成單詞和標點符號

例句：

一個人，一個計劃，運河 - 巴拿馬！
A =>字
whitespase =>標點符號
人=>字
，+空間=>標點符號
一個=>字
[...]

我試圖讀取此整個句子一次一個字符並收集相同的內容，並從此集合中創建新詞或新的Punctuation。

這裏是我的代碼：

public class Sentence { 

    private String sentence; 
    private LinkedList<SentenceElement> elements; 

    /** 
    * Constructs a sentence. 
    * @param aText a string containing all characters of the sentence 
    */ 
    public Sentence(String aText) { 
     sentence = aText.trim(); 
     splitSentence(); 
    } 

    public String getSentence() { 
     return sentence; 
    } 

    public LinkedList<SentenceElement> getElements() { 
     return elements; 
    } 

    /** 
    * Split sentance into words and punctuations 
    */ 
    private void splitSentence() { 
     if (sentence == "" || sentence == null || sentence == "\n") { 
      return; 
     } 

     StringBuilder builder = new StringBuilder(); 

     int j = 0; 
     boolean mark = false; 
     while (j < sentence.length()) { 
      //char current = sentence.charAt(j); 

      while (Character.isLetter(sentence.charAt(j))) { 
       if (mark) { 
        elements.add(new Punctuation(builder.toString())); 
        builder.setLength(0); 
        mark = false; 
       } 
       builder.append(sentence.charAt(j)); 
       j++; 
      } 
      mark = true; 

      while (!Character.isLetter(sentence.charAt(j))) { 
       if (mark) { 
        elements.add(new Word(builder.toString())); 
        builder.setLength(0); 
        mark = false; 
       } 
       builder.append(sentence.charAt(j)); 
       j++; 
      } 
      mark = true; 
     } 
    }

但splitSentence的邏輯（）時無法正常工作。我無法爲它找到正確的解決方案。

我要實現這個，我們讀第一個字符=>添加到構建器=>直到下一個元素都是同一類型（字母或標點符號）不斷加入到建設者=>當一個元素比建設者的內容不同=>創建新的單詞或標點符號和集合構建啓動。

做同樣的邏輯。

如何實現在正確的方式該檢查邏輯？

來源

2014-01-05 nazar_art

'如果（句子== 「」 ||句子== null ||一句話== 「\ n」）{'比較字符串與'equals.' –

'一句== 「」'。不要使用'=='比較字符串值。使用'equals（）'。 –

你考慮的BreakIterator？ http://docs.oracle.com/javase/7/docs/api/java/text/BreakIterator.html – caprica

拆分在字邊界的串（第一個除外）：

String[] parts = sentence.split("(?<!^)\\b");

陣列將包含交替的字/標點符號/字/標點/字等

這裏的一些測試代碼：

String sentence = "A man, a plan, a canal — Panama!"; 
String[] parts = sentence.split("(?<!^)\\b"); 
for (String part : parts) 
    System.out.println('"' + part + "\" (" + (part.matches("\\w+") ? "word" : "punctuation") + ")");

輸出：

"A" (word) 
" " (punctuation) 
"man" (word) 
", " (punctuation) 
"a" (word) 
" " (punctuation) 
"plan" (word) 
", " (punctuation) 
"a" (word) 
" " (punctuation) 
"canal" (word) 
" — " (punctuation) 
"Panama" (word) 
"!" (punctuation)

來源

2014-01-05 11:55:42 Bohemian

它創造了一些東西，如：[•，A，man，，a，plan，，a，，運河， - ，巴拿馬]'。靠近這個詞它保留空格。是否有可能將它們分開？ –

它適合我。請記住一些逗號是文字 – Bohemian

它的作品 - 請參閱編輯答案與測試代碼和輸出 – Bohemian

將句子分成單詞和標點符號

回答

相關問題