2013-08-20 54 views
0

應該用於提取由也應分析他們的標題界定的多個文本塊,例如什麼正則表達式:捕獲與正則表達式的多個文本塊中的java

some text info before message sequence 
============ 
first message header that should be parsed (may contain = character) 
============ 
first multiline 
message body that 
should also be parsed 
(may contain = character) 
============ 
second message header that should be parsed 
============ 
second multiline 
message body that 
should also be parsed 
... and so on 

我試圖用:

String regex = "^=+$\n"+ 
     "^(.+)$\n"+ 
     "^=+$\n"+ 
     "((?s:(?!(^=.+)).+))"; 
Pattern p = Pattern.compile(regex, Pattern.MULTILINE); 

((?s:(?!(^=.+)).+))吃第二個消息WEEL。這是顯示問題的測試:

import java.util.regex.Matcher; 
import java.util.regex.Pattern; 
import org.junit.Assert; 
import org.junit.Test; 
public class ParsingTest { 
@Test 
public void test() { 
    String fstMsgHeader = "first message header that should be parsed (may contain = character)"; 
    String fstMsgBody = "first multiline\n"+ 
         "message body that\n"+ 
         "should also be parsed\n"+ 
         "(may contain = character)"; 
    String sndMsgHeader = "second message header that should be parsed"; 
    String sndMsgBody = "second multiline\n"+ 
      "message body that\n"+ 
      "should also be parsed\n"+ 
      "... and so on"; 
    String sample = "some text info before message sequence\n"+ 
        "============\n"+ 
        fstMsgHeader+"\n"+ 
        "============\n"+ 
        fstMsgBody+"\n"+ 
        "============\n"+ 
        sndMsgHeader+"\n"+ 
        "============\n"+ 
        sndMsgBody +"\n"; 
    System.out.println(sample); 
    String regex = "^=+$\n"+ 
        "^(.+)$\n"+ 
        "^=+$\n"+ 
        "((?s:(?!(^=.+)).+))"; 
    Pattern p = Pattern.compile(regex, Pattern.MULTILINE); 
    Matcher matcher = p.matcher(sample); 
    int blockNumber = 1; 
    while (matcher.find()) { 
     System.out.println("Block "+blockNumber+": "+matcher.group(0)+"\n_________________"); 
     if (blockNumber == 1) { 
      Assert.assertEquals(fstMsgHeader, matcher.group(1)); 
      Assert.assertEquals(fstMsgBody, matcher.group(2)); 
     } else { 
      Assert.assertEquals(sndMsgHeader, matcher.group(1)); 
      Assert.assertEquals(sndMsgBody, matcher.group(2)); 
     } 
    } 
} 

}

+4

爲什麼不使用sample.split(「============」)? – Marc

+1

你期望得到什麼樣的產出,以及你實際擁有哪一種產出? – sp00m

+0

Reg。拆分用法:我已經完成了拆分,但是看起來,使用一個正則表達式捕獲消息及其頭部使得代碼更加清晰(一個while循環與組訪問器)。所以我正在考慮這個變種。 –

回答

1

我不知道如果這是你在找什麼,但也許這正則表達式將有助於

String regex = 
     "={12}\n" + // twelve '=' marks and new line mark 
     "(.+?)" +  // minimal match that has 
     "\n={12}\n" + // new line mark with twelve '=' marks after it 
     "(.+?)(?=\n={12}|$)"; // minimal match that will have new line 
           // character and twelve `=` marks after 
           // it or end of data $ 

,並使其發揮作用你應該使點也匹配Pattern.DOTALL標誌的新行字符。

Pattern p = Pattern.compile(regex, Pattern.DOTALL); 
+0

Pshemo,謝謝你的工作。你能描述一下(。+?)的含義嗎? –

+0

@MikhailTsaplin通常'(。+?)'是貪婪的,所以它會盡量找到最大可能。如果你添加'?',它會使'+'量詞不情願,所以它會嘗試找到最小匹配。有關詳情,請訪問http://docs.oracle.com/javase/tutorial/essential/regex/quant.html。 – Pshemo