2016-02-02 72 views
2

我必須解析多行字符串並檢索特定位置的電子郵件地址。正則表達式中的嵌套/重複組

而且我用下面的代碼完成它:

String input = "Content-Type: application/ms-tnef; name=\"winmail.dat\"\r\n" 
      + "Content-Transfer-Encoding: binary\r\n" + "From: ABC aa DDD <[email protected]>\r\n" 
      + "To: DDDDD dd <[email protected]>\r\n" + "CC: Rrrrr rrede <[email protected]>, Dsssssf V R\r\n" 
      + " <[email protected]>, Psssss A <[email protected]>, Logistics\r\n" 
      + " <[email protected]>, Gssss Bsss P <[email protected]>\r\n" 
      + "Subject: RE: [MyApps] (PRO-34604) PR for Additional Monitor allocation [CITS\r\n" 
      + " Ticket:258849]\r\n" + "Thread-Topic: [MyApps] (PRO-34604) PR for Additional Monitor allocation\r\n" 
      + " [CITS Ticket:258849]\r\n" + "Thread-Index: AQHRXMJHE6KqCFxKBEieNqGhdNy7Pp8XHc0A\r\n" 
      + "Date: Mon, 1 Feb 2016 17:56:17 +0530\r\n" 
      + "Message-ID: <[email protected]>\r\n" 
      + "References: <[email protected]>\r\n" 
      + " <[email protected]>\r\n" 
      + "In-Reply-To: <[email protected]>\r\n" 
      + "Accept-Language: en-US\r\n" + "Content-Language: en-US\r\n" + "X-MS-Has-Attach:\r\n" 
      + "X-MS-Exchange-Organization-SCL: -1\r\n" 
      + "X-MS-TNEF-Correlator: <[email protected]>\r\n" 
      + "MIME-Version: 1.0\r\n" + "X-MS-Exchange-Organization-AuthSource: TURWINSRVRPS01.abc.com\r\n" 
      + "X-MS-Exchange-Organization-AuthAs: Internal\r\n" + "X-MS-Exchange-Organization-AuthMechanism: 04\r\n" 
      + "X-Originating-IP: [1.1.1.7]"; 

    Pattern pattern = Pattern.compile("To:(.*<([^>]*)>).*Message-ID", Pattern.DOTALL); 
    Matcher matcher = pattern.matcher(input); 
    while (matcher.find()) { 
     Pattern innerPattern = Pattern.compile("<([^>]*)>"); 
     Matcher innerMatcher = innerPattern.matcher(matcher.group(1)); 
     while (innerMatcher.find()) { 
      System.out.println("-->:" + innerMatcher.group(1)); 
     } 
    } 

這工作正常。我將第一部分從To分組到Message這是必需的部分。然後我有另一個分組來提取電子郵件ID。 有沒有更好的方法來做到這一點?我們可以用一個模式匹配器來做到嗎?

更新: 這是預期的輸出:

-->:[email protected] 
-->:[email protected] 
-->:[email protected] 
-->:[email protected] 
-->:[email protected] 
-->:[email protected] 
+0

您能展示您期望檢索的內容嗎? – Cyrbil

回答

1

我認爲你正在尋找內<...>所有電子郵件To:Message-ID之前到來。所以,你可以使用\G基於正則表達式一通:

Pattern pt = Pattern.compile("(?:\\bTo:|(?!^)\\G).*?<([^>]*)>(?=.*Message-ID)", Pattern.DOTALL); 
Matcher m = pt.matcher(input); 
while (m.find()) { 
    System.out.println(m.group(1)); 
} 

IDEONE demoregex demo

正則表達式匹配:

  • (?:\\bTo:|(?!^)\\G) - 領先的邊界,無論是To:作爲一個整體字或上一次成功匹配後的位置
  • .*? - 任何字符,任意數量的出現在fi首先
  • <([^>]*)> - 串開始<隨後與比>其他零個或多個字符(第1組),並遵循的收盤>
  • (?=.*Message-ID) - 積極前瞻,使得確保有Message-ID前面傳來的電流匹配。
+0

隨着這個答案,[這](http://stackoverflow.com/a/35154460/2270563)答案也是有幫助的! – Ram

2

理想情況下,你也可以使用lookarounds:

(?<=To:.*)<([^>]+)>(?=.*Message-ID) 

Regular expression visualization

可視化的Debuggex


不幸的是,Java doesn't support variable length in lookbehinds。解決方法可能是:

(?<=To:.{0,1000})<([^>]+)>(?=.*Message-ID) 
+0

Java支持您在答案中顯示的[*受限制lookbehind *](http://www.rexegg.com/regex-lookarounds.html#width)。 –