我有一堆XML格式的聊天記錄。我在下面列出了一個樣本記錄。我不需要整個記錄,只有3件事。第一個屬性是@realTimeID。第二/第三項是源==「PostChat」時varValue的值。這將包含1-10的數值。可能有第二個值包含文本條目。只有少量記錄會包含這些「PostChat」值。R XML到DataFrame子集化
我想要的是一個數據框,其中包含realTimeID的列,然後是數字和可能的文本值的2列。如果我能得到一個數據幀,其中包含realTimeID的列和第二列的值,那麼我可以使用那裏的數據。
下面是一些樣本數據:
<Report account="12345" end_time="2016-07-01T00:00:59+00:00" limit="10000" more_sessions="true" start_time="2016-06-11T00:00:00+00:00" user="smith">
<Session id="ID1536678170" realTimeID="4768543970">
<Visitor id="1131902386012684">
<ip>123.456.789</ip>
<agent>Chrome 51.0.2704.63</agent>
<host/>
<chatReferer> foo </chatReferer>
<GeoInfo>
<geoCity/>
<geoConType/>
<geoCountry>USA</geoCountry>
<geoIP>123.456.789</geoIP>
<geoISP>USA ISP</geoISP>
<geoOrg>NA</geoOrg>
<geoPost/>
<geoReg/>
</GeoInfo>
</Visitor>
<Chat end_time="2016-06-11T21:46:14+00:00" start_time="2016-06-11T21:25:59+00:00">
<line by="info" time="2016-06-11T21:25:59+00:00">
<Text>Please do not post credit card or other sensitive data in this window. </Text>
</line>
<line by="info" time="2016-06-11T21:26:03+00:00">
<Text>You are now chatting with John.</Text>
</line>
<line by="John" repId="ID2447" time="2016-06-11T21:28:04+00:00">
<HTML><span dir="ltr">Hi sir</span></HTML>
</line>
<line by="John" repId="ID2447" time="2016-06-11T21:28:15+00:00">
<HTML><span dir="ltr">How may i help you ?</span></HTML>
</line>
<line by="you" time="2016-06-11T21:28:16+00:00">
<Text>Hi John. Im Bob. I have a technical question.</Text>
</line>
</Chat>
<VarValues>
<varValue id="ID917165" name="DisconnectedBy" source="Internal" sourceName="null" time="2016-06-11T21:46:14+00:00">RepStoppedChat</varValue>
<varValue id="ID922205" name="language" source="MonitorTag" sourceName="null" time="2016-06-11T21:23:46+00:00">English</varValue>
<varValue id="ID1317606" name="pageLoadTime" source="MonitorTag" sourceName="null" time="2016-06-11T21:23:46+00:00">88 sec</varValue>
<varValue id="ID1323660" name="survey90990357" source="Operator" sourceName="null" time="2016-06-11T21:32:38+00:00">Incomplete (INC) - customer abandons</varValue>
<varValue id="ID1372749" name="LP_Visitor_Category" source="Internal" sourceName="null" time="2016-06-11T21:23:43+00:00">0</varValue>
<varValue id="ID1617100" name="live_engage_control_group" source="Internal" sourceName="null" time="2016-06-11T21:23:43+00:00">false</varValue>
<varValue id="ID3647561" name="rerouteFlag" source="Rule Engine" sourceName="null" time="2016-06-11T21:23:46+00:00">true</varValue>
<varValue id="ID3665417" name="operatorName" source="Rule Engine" sourceName="null" time="2016-06-11T21:26:03+00:00">John Doe</varValue>
<varValue id="ID3730453" name="RenameFlag" source="Rule Engine" sourceName="null" time="2016-06-11T21:23:46+00:00">true</varValue>
<varValue id="ID3742796" name="PT-EligibleInSession" source="Rule Engine" sourceName="null" time="2016-06-11T21:24:43+00:00">Yes</varValue>
<varValue id="ID3834774" name="survey88140234" source="PostChat" sourceName="null" time="2016-06-13T04:44:54+00:00">10</varValue>
<varValue id="ID3834774" name="survey88140234" source="PostChat" sourceName="null" time="2016-06-13T04:44:54+00:00">Great Experience. Thanks for the help!</varValue>
</VarValues>
<Reps>
<Rep endTime="2016-06-11T21:46:14+00:00" id="ID2447" order="1" repName="John Doe" startTime="2016-06-11T21:26:03+00:00">John</Rep>
</Reps>
</Session>
<Session id="ID1536678170" realTimeID="123456789">
<Visitor id="1131902386012684">
<ip>123.456.789</ip>
<agent>Chrome 51.0.2704.63</agent>
<host/>
<chatReferer> foo </chatReferer>
<GeoInfo>
<geoCity/>
<geoConType/>
<geoCountry>USA</geoCountry>
<geoIP>123.456.789</geoIP>
<geoISP>USA ISP</geoISP>
<geoOrg>NA</geoOrg>
<geoPost/>
<geoReg/>
</GeoInfo>
</Visitor>
<Chat end_time="2016-06-11T21:46:14+00:00" start_time="2016-06-11T21:25:59+00:00">
<line by="info" time="2016-06-11T21:25:59+00:00">
<Text>Please do not post credit card or other sensitive data in this window. </Text>
</line>
<line by="info" time="2016-06-11T21:26:03+00:00">
<Text>You are now chatting with John.</Text>
</line>
<line by="John" repId="ID2447" time="2016-06-11T21:28:04+00:00">
<HTML><span dir="ltr">Hi sir</span></HTML>
</line>
<line by="John" repId="ID2447" time="2016-06-11T21:28:15+00:00">
<HTML><span dir="ltr">How may i help you ?</span></HTML>
</line>
<line by="you" time="2016-06-11T21:28:16+00:00">
<Text>Hi John. Im Bob. I have a technical question.</Text>
</line>
</Chat>
<VarValues>
<varValue id="ID917165" name="DisconnectedBy" source="Internal" sourceName="null" time="2016-06-11T21:46:14+00:00">RepStoppedChat</varValue>
<varValue id="ID922205" name="language" source="MonitorTag" sourceName="null" time="2016-06-11T21:23:46+00:00">English</varValue>
<varValue id="ID1317606" name="pageLoadTime" source="MonitorTag" sourceName="null" time="2016-06-11T21:23:46+00:00">88 sec</varValue>
<varValue id="ID1323660" name="survey90990357" source="Operator" sourceName="null" time="2016-06-11T21:32:38+00:00">Incomplete (INC) - customer abandons</varValue>
<varValue id="ID1372749" name="LP_Visitor_Category" source="Internal" sourceName="null" time="2016-06-11T21:23:43+00:00">0</varValue>
<varValue id="ID1617100" name="live_engage_control_group" source="Internal" sourceName="null" time="2016-06-11T21:23:43+00:00">false</varValue>
<varValue id="ID3647561" name="rerouteFlag" source="Rule Engine" sourceName="null" time="2016-06-11T21:23:46+00:00">true</varValue>
<varValue id="ID3665417" name="operatorName" source="Rule Engine" sourceName="null" time="2016-06-11T21:26:03+00:00">John Doe</varValue>
<varValue id="ID3730453" name="RenameFlag" source="Rule Engine" sourceName="null" time="2016-06-11T21:23:46+00:00">true</varValue>
<varValue id="ID3742796" name="PT-EligibleInSession" source="Rule Engine" sourceName="null" time="2016-06-11T21:24:43+00:00">Yes</varValue>
</VarValues>
<Reps>
<Rep endTime="2016-06-11T21:46:14+00:00" id="ID2447" order="1" repName="John Doe" startTime="2016-06-11T21:26:03+00:00">John</Rep>
</Reps>
</Session>
</Report>
我可以在數據讀取方面:
library(XML)
dat <- xmlInternalTreeParse("data/sessions6.xml", useInternalNodes = T)
我可以用提取realTimeID值:
foo <- xpathApply(datRoot, "//Session", xmlGetAttr, "realTimeID")
和varValues我需要使用:
tmp <- xpathApply(datRoot, "//varValue[@source='PostChat']", xmlValue)
但我不知道如何連接兩個並獲取與PostChat varValues關聯的realTimeID值。另外,我想過用realTimeID和所有的VarValues創建一個數據框。我顯然擁有所有ID的列表,但不知道如何才能提取VarValues的數據幀。任何幫助,將不勝感激。
編輯:我更新了我的XML代碼示例,使其更加完整,幷包含一個使用PostChat值的會話和一個沒有的會話。謝謝!
將每個XML只能攜帶一個或兩個'varValues'與'@ postChat'?對於3個變量的一排df? – Parfait
每個「記錄」應該不超過2個「PostChat」varValues。但是,大多數記錄不會有任何PostChat varValues。我發佈了單個「記錄」的XML。我的每個xml文件實際上都有10k列表。 –