2014-01-05 73 views
2

我需要將類Sentence解析爲單詞和標點符號(空格被認爲是標點符號),然後將其全部添加到常規的ArrayList<Sentence>中。將句子分成單詞和標點符號

例句:

一個人,一個計劃,運河 - 巴拿馬!
A =>字
whitespase =>標點符號
人=>字
,+空間=>標點符號
一個=>字
[...]

我試圖讀取此整個句子一次一個字符並收集相同的內容,並從此集合中創建新詞或新的Punctuation

這裏是我的代碼:

public class Sentence { 

    private String sentence; 
    private LinkedList<SentenceElement> elements; 

    /** 
    * Constructs a sentence. 
    * @param aText a string containing all characters of the sentence 
    */ 
    public Sentence(String aText) { 
     sentence = aText.trim(); 
     splitSentence(); 
    } 

    public String getSentence() { 
     return sentence; 
    } 

    public LinkedList<SentenceElement> getElements() { 
     return elements; 
    } 

    /** 
    * Split sentance into words and punctuations 
    */ 
    private void splitSentence() { 
     if (sentence == "" || sentence == null || sentence == "\n") { 
      return; 
     } 

     StringBuilder builder = new StringBuilder(); 

     int j = 0; 
     boolean mark = false; 
     while (j < sentence.length()) { 
      //char current = sentence.charAt(j); 

      while (Character.isLetter(sentence.charAt(j))) { 
       if (mark) { 
        elements.add(new Punctuation(builder.toString())); 
        builder.setLength(0); 
        mark = false; 
       } 
       builder.append(sentence.charAt(j)); 
       j++; 
      } 
      mark = true; 

      while (!Character.isLetter(sentence.charAt(j))) { 
       if (mark) { 
        elements.add(new Word(builder.toString())); 
        builder.setLength(0); 
        mark = false; 
       } 
       builder.append(sentence.charAt(j)); 
       j++; 
      } 
      mark = true; 
     } 
    } 

但splitSentence的邏輯()時無法正常工作。我無法爲它找到正確的解決方案。

我要實現這個,我們讀第一個字符=>添加到構建器=>直到下一個元素都是同一類型(字母或標點符號)不斷加入到建設者=>當一個元素比建設者的內容不同=>創建新的單詞或標點符號和集合構建啓動。

做同樣的邏輯。

如何實現在正確的方式該檢查邏輯?

+0

'如果(句子== 「」 ||句子== null ||一句話== 「\ n」){'比較字符串與'equals.' –

+0

'一句== 「」'。不要使用'=='比較字符串值。使用'equals()'。 –

+0

你考慮的BreakIterator? http://docs.oracle.com/javase/7/docs/api/java/text/BreakIterator.html – caprica

回答

3

拆分在字邊界的串(第一個除外):

String[] parts = sentence.split("(?<!^)\\b"); 

陣列將包含交替的字/標點符號/字/標點/字等


這裏的一些測試代碼:

String sentence = "A man, a plan, a canal — Panama!"; 
String[] parts = sentence.split("(?<!^)\\b"); 
for (String part : parts) 
    System.out.println('"' + part + "\" (" + (part.matches("\\w+") ? "word" : "punctuation") + ")"); 

輸出:

"A" (word) 
" " (punctuation) 
"man" (word) 
", " (punctuation) 
"a" (word) 
" " (punctuation) 
"plan" (word) 
", " (punctuation) 
"a" (word) 
" " (punctuation) 
"canal" (word) 
" — " (punctuation) 
"Panama" (word) 
"!" (punctuation) 
+0

它創造了一些東西,如:[•,A,man,,a,plan,,a, ,運河, - ,巴拿馬]'。靠近這個詞它保留空格。是否有可能將它們分開? –

+0

它適合我。請記住一些逗號是文字 – Bohemian

+0

它的作品 - 請參閱編輯答案與測試代碼和輸出 – Bohemian

相關問題