電子郵件的文本挖掘

我在文本文件中有一組電子郵件。我想從中提取身體。示例文檔如下所示。電子郵件的文本挖掘

Email: 1 
=============== 


    MIME-Version: 1.0 
    Received: by 10.68.8.6 with HTTP; Sat, 7 Apr 2012 01:04:45 -0700 (PDT) 
    Date: Sat, 7 Apr 2012 13:34:45 +0530 
    Delivered-To: [email protected] 
    Message-ID: <[email protected]om> 
    Subject: hello 
    From: twisty princess <[email protected]> 
    To: twisty princess <[email protected]> 
    Content-Type: multipart/alternative; boundary=047d7b33d826e6762004bd1239b5 
    --047d7b33d826e6762004bd1239b5    
    Content-Type: text/plain; charset=ISO-8859-1 

    hey How are you doing? 

    --047d7b33d826e6762004bd1239b5  
    Content-Type: text/html; charset=ISO-8859-1 

    <br><br>hey How are you doing?<br> 

    --047d7b33d826e6762004bd1239b5--

因此，從這篇文章中，我只是想「嘿，你好嗎？」。我想用正則表達式和C＃來完成這個任務。由於

來源

2012-04-09 Cyang

一個文本文件和很多這樣的部分？所有的電子郵件是否對稱/遵循相同的格式？電子郵件：1和文本文件中的雙行分隔符，或者您已將其插入SO？ – 2012-04-09 05:52:00

是的，所有的電子郵件都是相同的格式 – Cyang 2012-04-09 06:01:39

使用正則表達式boundary=([^\s]+)找到邊界名

var bname = _boundaryRegex.Match(text).Groups[1].Value;

然後用格式的文本捕獲正則表達式bname

var textCapturer = new Regex(string.Format("--{0}(?<text>.*?)(?=--)",bname); 
foreach(var match in textCapturer.Matches(text)) 
{ 
    Console.WriteLine(match.Groups["text"]); 
}

它發現boundary參數的值，然後嘗試匹配文本beetween --boundary線。

雖然我不建議你使用正則表達式來做這種解析。

來源

2012-04-09 05:53:12

你能告訴我如何使用它？ – Cyang 2012-04-09 08:45:47

檢查編輯 – 2012-04-09 08:57:37

電子郵件的文本挖掘

回答

相關問題