2013-02-25 46 views
0

我有一個我正在使用Lua和/或XSL處理的文檔,因爲我正在使用的解決方案允許這兩個。正在處理的數據是來自Lync 2013的即時通訊對話的彙編。我已經能夠編寫一些模式匹配腳本,以便在下面提取我的數據的一些值,但由於用戶能夠配置他們希望顯示數據的方式在他們的IM上,每個用戶的數據存儲方式都不相同。Lua或XSL - 將字從字符串中拉出Lync

我需要的是一個腳本,它將提取消息的收件人,發件人,日期/時間和內容中的所有值。我注意到每個單詞在封裝在RTF標籤中後面的字符串都是'\ embo0'。

有沒有一種方法可以像下面的例子一樣處理整個數據集,以在示例數據下生成我想要的結果?我所擁有的腳本只能拉出與我定義的模式匹配方案匹配的部分會話,但隨後會剝離其他所有內容。

數據:

<?xml version="1.0" encoding="utf-8"?> 
<session Type="Conversation" SessionIdTime="2013-01-18 17:18:01Z" SessionIdSeq="1"> 
<Reference>OCSSession-Conversation_2013-01-18 17:18:01Z_1</Reference> 
<participants> 
    <participant> 
     <name>[email protected]</name> 
    </participant> 
    <participant> 
     <name>[email protected]</name> 
    </participant> 
</participants> 
<conversation InviteTime="2013-01-18 17:18:01Z" InitiatedBy="[email protected]" /> 
<messages> 
    <message Id="1" Time="2013-01-18 17:18:01Z"> 
     <from>[email protected]</from> 
     <to>[email protected]</to> 
     <content Type="text/html">&lt;span style="font-family:Segoe UI;  color:#000000; font-size:10pt;"&gt;Test from Lync 2013&lt;/span&gt;</content> 
    </message> 
    <message Id="2" Time="2013-01-18 17:18:02Z"> 
     <from>[email protected]</from> 
     <to>[email protected]</to> 
     <content Type="text/rtf">{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}} 
{\colortbl ;\red0\green0\blue0;} 
{\*\generator Riched20 15.0.4420}{\*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 
\pard\cf1\embo\f0\fs20 Test\embo0 \embo from\embo0 \embo Lync\embo0 \embo 2013\embo0 \f1\par 
{\*\lyncflags rtf=1}} 
     </content> 
    </message> 
    <message Id="3" Time="2013-01-18 17:18:07Z"> 
     <from>[email protected]</from> 
     <to>[email protected]</to> 
     <content Type="text/html">&lt;DIV style="font-size: 9pt;font-family: MS Shell Dlg 2;color: #000000;direction: ltr"&gt;got it&lt;/DIV&gt;</content> 
    </message> 
    <message Id="4" Time="2013-01-18 17:20:05Z"> 
     <from>[email protected]</from> 
     <to>[email protected]</to> 
     <content Type="text/rtf">{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil Segoe UI;}} 
{\colortbl ;\red0\green0\blue0;\red0\green0\blue255;} 
{\*\generator Riched20 15.0.4420}{\*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 
\pard {\cf1\outl\f0\fs20{\field{\*\fldinst{HYPERLINK http://jefferytay.wordpress.com /2010/12/09/converting-a-pfx-file-to-pem-and-key-via-openssl/ }}{\fldrslt{http://jefferytay.wordpress.com/2010/12/09/converting-a-pfx-file-to-pem-and-key-via-openssl/\ul0\cf0}}}}\f0\fs20\par 
{\*\lyncflags rtf=1}} 
     </content> 
    </message> 
    <message Id="5" Time="2013-01-18 17:20:19Z"> 
     <from>[email protected]</from> 
     <to>[email protected]</to> 
     <content Type="text/rtf">{\rtf1\fbidis\ansi\ansicpg1252\deff0 \nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}} 
{\colortbl ;\red0\green0\blue0;} 
{\*\generator Riched20 15.0.4420}{\*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 
\pard\cf1\embo\f0\fs20 How\embo0 \embo does\embo0 \embo the\embo0 \embo URL\embo0  \embo look\embo0 \embo on\embo0 \embo your\embo0 \embo end?\embo0\f1\par 
{\*\lyncflags rtf=1}} 
     </content> 
    </message> 
    <message Id="6" Time="2013-01-18 17:20:25Z"> 
     <from>[email protected]</from> 
     <to>[email protected]</to> 
     <content Type="text/html">&lt;DIV style="font-size: 9pt;font-family: MS Shell Dlg 2;color: #000000;direction: ltr"&gt;its plain text&lt;/DIV&gt;</content> 
    </message> 
    <message Id="7" Time="2013-01-18 17:20:34Z"> 
     <from>[email protected]</from> 
     <to>[email protected]</to> 
     <content Type="text/html">&lt;DIV style="font-size: 9pt;font-family: MS Shell Dlg 2;color: #000000;direction: ltr"&gt;not clickable&lt;/DIV&gt;</content> 
    </message> 
    <message Id="8" Time="2013-01-18 17:20:50Z"> 
     <from>[email protected]</from> 
     <to>[email protected]</to> 
     <content Type="text/html">&lt;DIV style="font-size: 9pt;font-family: MS Shell Dlg 2;color: #000000;direction: ltr"&gt;how does this look?&amp;nbsp; _http://www.cnn.com&lt;/DIV&gt;</content> 
    </message> 
    <message Id="9" Time="2013-01-18 17:21:07Z"> 
     <from>[email protected]</from> 
     <to>[email protected]</to> 
     <content Type="text/html">&lt;DIV style="font-size: 9pt;font-family: MS Shell Dlg 2;color: #000000;direction: ltr"&gt;_http://powertoe.wordpress.com/2009/12/14/powershell-part-4-arrays-and-for-loops/&lt;/DIV&gt;</content> 
    </message> 
    <message Id="10" Time="2013-01-18 17:21:38Z"> 
     <from>[email protected]</from> 
     <to>[email protected]</to> 
     <content Type="text/rtf">{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}} 
{\colortbl ;\red0\green0\blue0;\red0\green0\blue255;} 
{\*\generator Riched20 15.0.4420}{\*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 
\pard\cf1\embo\f0\fs20 Please\embo0 \embo go\embo0 \embo ahead\embo0 \embo and\embo0 \embo install\embo0 \embo the\embo0 \embo new\embo0 \embo client\embo0   {\embo{\field{\*\fldinst{HYPERLINK "n:\\\\apps\\\\microsoft\\\\lync2013\\\\client\\\\setup.exe"}}{\fldrslt{n:\\apps\\microsoft\\lync2013\\client\\setup.exe\ul0\cf0}}}}\f0\fs20 \embo Once\embo0 \embo you\embo0 \embo install\embo0 \embo it,\embo0 \embo it\embo0 \embo will\embo0 \embo force\embo0 \embo a\embo0 \embo reboot.\embo0 \embo After\embo0 \embo it\embo0 \embo reboots,\embo0 \embo you\embo0 \embo have\embo0 \embo to\embo0 \embo close\embo0 \embo out\embo0 \embo of\embo0 \embo communicator.exe\embo0 \embo completely.\embo0\f1\par 
{\*\lyncflags rtf=1}} 
     </content> 
    </message> 
</messages> 

所需的輸出:

From: [email protected]</name> 
To: [email protected]</name> 

2013-01-18 17:18:02Z 
[email protected]: Test from Lync 2013 

2013-01-18 17:18:07Z 
[email protected]: got it 

2013-01-18 17:20:05Z 
[email protected]: http://jefferytay.wordpress.com/2010/12/09/converting-a-pfx-file-to-pem-and-key-via-openssl/ 

2013-01-18 17:20:19Z: How does the URL look on your end? 

2013-01-18 17:20:25Z 
[email protected]: its plain text 

2013-01-18 17:20:34Z 
[email protected]: not clickable 

2013-01-18 17:20:50Z 
[email protected]: how does this look? _http://www.cnn.com 

2013-01-18 17:21:07Z 
[email protected]: _http://powertoe.wordpress.com/2009/12/14/powershell-part-4-arrays-and-for-loops/ 

2013-01-18 17:21:38Z: 
[email protected]: Please go ahead and install the new client 
Once you install it, it will force a reboot. After it reboots, you have to close out of communicator.exe completely. 

回答

0
local function convert_string(type, str) 
    if type == 'text/html' then 
     str = str:gsub('&lt;.-&gt;', ''):gsub('&amp;.-;', '') 
    elseif type == 'text/rtf' then 
     local result = '' 
     str 
     :match'.-{(.*)}.-' 
     :gsub('%b{}', 
      function(s) 
       local link = s:match'{HYPERLINK%s+(.-)}' 
       if link then 
        return 'HYPERLINK'..link:gsub('\\\\\\', '')..'\\embo' 
       else 
        return '' 
       end 
      end 
     ) 
     :gsub('(%S*.)\\embo', 
      function(s) 
       if s:match'^HYPERLINK' or s:find('\\', 1, true) == nil then 
        result = result..s:gsub('^HYPERLINK', ' ') 
       end 
      end 
     ) 
     str = result 
    end 
    return str 
end 

local function extract_text(input_text) 
    local result = '' 
    for time, from, to, type, content in input_text:gmatch'<message.-Time="(.-)".-<from>(.-)</.-<to>(.-)</.-<content Type="(.-)".->(.-)</content' do 
     result = result..'\n\n'..time..'\n'..from..': '..convert_string(type, content) 
    end 
    return result 
end 

print(extract_text(assert(io.open'your_file_name'):read'*a'))