2010-06-30 45 views
3

從顯示Gmail網頁的地方解析Gmail聊天記錄的最佳方式是什麼?據我所知,這仍然是訪問服務器託管的Gmail聊天記錄(通過桌面版Gmail或移動版Gmail)的唯一方式。如何解析網頁中的Gmail聊天記錄?

當查看生成的對話發生的來源時,標記看起來像是嵌套的div和跨度(並且頁面中其他位置的div具有隨機化的兩個字符的id和沒有模式的類)。下面是一條線,有一個時間戳左邊的摘錄:

<div> 
<span style="display:block;float:left;color:#888"> 
2:56 PM&nbsp; 
</span> 

<span style="display:block;padding-left:6em"> 
<span> 

<span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs 

</span> 
</span> 
</div> 

但並不是每一個行有一個時間戳,所以那些沒有一個似乎將不間斷在其位的空間:

<div> 
<span style="display:block;float:left;color:#888"> 
&nbsp;&nbsp; 
</span> 

<span style="display:block;padding-left:6em"> 

<span> 
and reformat that into something like an xml format 
</span> 

</span> 
</div> 

應該我使用XPath?有沒有更有效率的東西?

編輯:

隨着數據而已,這是它的樣子:

12:43 AM John: Something something something. 
     Something something something. 
     me: Something something something? 
12:44 AM Also, something something something. 
12:47 AM Something something something. 
12:48 AM Something something something 
     with something something something. 
12:49 AM John: Something. 
+0

你忘了提及你想要的節點選擇? – 2010-06-30 19:26:32

+0

我想要獲取名稱,對話行和時間戳。 所以每一行都可能是[時間] [名稱] [所說的東西],其中時間是可選的,並且名字在沒有明確寫入的地方填充。 – chimerical 2010-06-30 20:11:54

回答

1

我應該使用XPath? 更有效嗎?

我會使用Ruby與引入nokogiri庫,它爲您提供了更大的靈活性不僅僅在XPath/XSLT:

#!/usr/bin/ruby 
require 'rubygems' 
require 'nokogiri' 

src = <<EOS 
<div> 
    <span style="display:block;float:left;color:#888"> 
     2:56 PM&nbsp; 
    </span> 
    <span style="display:block;padding-left:6em"> 
     <span> 
      <span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs 
     </span> 
    </span> 
    <span style="display:block;float:left;color:#888"> 
     &nbsp;&nbsp; 
    </span> 
    <span style="display:block;padding-left:6em"> 
     <span> 
      and reformat that into something like an xml format 
     </span> 
    </span> 
</div> 
EOS 

chatlog = [] 
last_timestamp = nil 
doc = Nokogiri::HTML(src) 

doc.xpath('//div/span').each do |span| 
    style = span.attributes['style'].value 

    if style.include?('color:') 
     last_timestamp = span.content.strip 
    elsif style.include?('padding-left:') 
     chatlog << {:timestamp => last_timestamp, :message => span.content.strip} 
    end 
end 

builder = Nokogiri::XML::Builder.new do |doc| 
    doc.chatlog { 
     chatlog.each do |line| 
      doc.line { 
       doc.time line[:timestamp] 
       doc.message line[:message] 
      } 
     end 
    } 
end 

返回:

<?xml version="1.0" encoding="UTF-8"?> 
<chatlog> 
    <line> 
    <time>2:56 PM </time> 
    <message>me: i'm trying to think of a good way to parse gmail chat logs</message> 
    </line> 
    <line> 
    <time>  </time> 
    <message>and reformat that into something like an xml format</message> 
    </line> 
</chatlog>