2012-11-20 157 views
5

我想抓取和解析以下RSS提要http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml我已經查看了有關R和XML的其他查詢,並且無法對我的問題取得任何進展。每個條目使用XML packagin解析RSS提要R

 <item> 
    <title><![CDATA[Five Rockets Intercepted By Iron Drone Systems Over Be'er Sheva]]></title> 
    <link>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</link> 
    <description><![CDATA[<a href="http://www.haaretz.com/news/diplomacy-defense/live-blog-rockets-strike-tel-aviv-area-three-israelis-killed-in-attack-on-south-1.477960" target="_hplink">Haaretz reports</a> that five more rockets intercepted by Iron Dome systems over Be'er Sheva. In total, there have been 274 rockets fired and 105 intercepted. The IDF has attacked 250 targets in Gaza.]]></description> 
    <guid>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</guid> 
    <pubDate>2012-11-15T12:56:09-05:00</pubDate> 
    <source url="http://huffingtonpost.com/rss/liveblog/liveblog-1213.xml">Huffingtonpost.com</source> 
    </item> 

對於每個條目/文章,我想記錄「日期」(pubdate的),「標題」(標題),「說明」(清洗全文)XML代碼。我曾嘗試在R中使用xml包,但承認我有點新手(使用XML很少或沒有經驗,但有一些R經驗)。我工作過,並與越來越行不通的代碼是:

library(XML) 

xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml" 

# Use the xmlTreePares-function to parse xml file directly from the web 

xmlfile <- xmlTreeParse(xml.url) 

# Use the xmlRoot-function to access the top node 

xmltop = xmlRoot(xmlfile) 

xmlName(xmltop) 

names(xmltop[[ 1 ]]) 

    title   link description  language  copyright 
    "title"  "link" "description" "language" "copyright" 
category  generator   docs   item   item 
    "category" "generator"  "docs"  "item"  "item" 

但是,每當我試圖操縱和試圖操縱「標題」或「說明」的信息,我不斷收到錯誤。任何幫助解決這個代碼的幫助,將不勝感激。

感謝, 托馬斯

回答

10

我使用的是優秀的Rcurl庫和xpathSApply

這是腳本給你3名列表(標題,pubdates和說明)

library(RCurl) 
library(XML) 
xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml" 
script <- getURL(xml.url) 
doc  <- xmlParse(script) 
titles <- xpathSApply(doc,'//item/title',xmlValue) 
descriptions <- xpathSApply(doc,'//item/description',xmlValue) 
pubdates <- xpathSApply(doc,'//item/pubDate',xmlValue) 
+0

瞭解更多信息,xpathSApply在XML庫中 –