2016-11-21 132 views
2

我創建從下面的新聞源RSS http://indianexpress.com/section/india/feed/無法湊新聞網站

我從這個XML

  • 標題
  • 標題URL
  • 出版日期閱讀下面的數據的數據集

我現在使用標題url來獲取de (摘要,在主標題下方) - 通過點擊每個網址並獲取數據

但是,我正面臨着向量長度(197)與其他(200)的描述不匹配。 因爲這個我無法創建我的數據幀

有人能幫助我如何能有效地颳去數據

下面的代碼是可重複的

library("httr") 
library("RCurl") 
library("jsonlite") 
library("lubridate") 
library("rvest") 
library("XML") 
library("stringr") 

url = "http://indianexpress.com/section/india/feed/" 

newstopics = getURL(url) 

newsxml = xmlParse(newstopics) 

title <- xpathApply(newsxml, "//item/title", xmlValue) 
title <- unlist(title) 

titleurl <- xpathSApply(newsxml, '//item/link', xmlValue) 
pubdate <- xpathSApply(newsxml, '//item/pubDate', xmlValue) 

t1 = Sys.time() 
desc <- NULL 

for (i in 1:length(titleurl)){ 

    page = read_html(titleurl[i]) 
    temp = html_text(html_nodes(page,'.synopsis')) 
    desc = c(desc,temp) 

} 

print(difftime(Sys.time(), t1, units = 'sec')) 

desc = gsub("\n",' ',desc) 

newsdata = data.frame(title,titleurl,desc,pubdate) 

我收到以下錯誤:

Error in data.frame(title, titleurl, desc, pubdate) : 
arguments imply differing number of rows: 200, 197 
+0

我認爲這個問題是關係到'temp'不會爲'for'循環中的每個迭代返回一個值。嘗試用'desc = c(desc,paste0(「」,temp))'替換'desc'行 - 儘管更優雅的錯誤處理是期望的。 – JasonAizkalns

+0

我檢查了titleurl在任何地方都不爲空。我假設由於每個網址都是一個報紙鏈接,他們肯定會有一個副標題 –

回答

0

您可以執行以下操作:

library(tidyverse) 
library(xml2) 
library(rvest) 

feed <- read_xml("http://indianexpress.com/section/india/feed/") 

# helper function to extract information from the item node 
item2vec <- function(item){ 
    tibble(title = xml_text(xml_find_first(item, "./title")), 
     link = xml_text(xml_find_first(item, "./link")), 
     pubDate = xml_text(xml_find_first(item, "./pubDate"))) 
} 

dat <- feed %>% 
    xml_find_all("//item") %>% 
    map_df(item2vec) 

# The following takes a while 
dat <- dat %>% 
    mutate(desc = map_chr(dat$link, ~read_html(.) %>% html_node('.synopsis') %>% html_text)) 

它給你data.frame/tibble有4列:

> glimpse(dat) 
Observations: 200 
Variables: 4 
$ title <chr> "Common man has no problem with note ban, says Santosh Gangwar", "Bombay High Court comes... 
$ link <chr> "http://indianexpress.com/article/india/india-news-india/demonetisation-note-ban-cash-cru... 
$ pubDate <chr> "Mon, 21 Nov 2016 20:04:21 +0000", "Mon, 21 Nov 2016 20:01:43 +0000", "Mon, 21 Nov 2016 1... 
$ desc <chr> "MoS for Finance speaks to Indian Express in Bareilly, his Lok Sabha constituency.", "The... 

PS:爲了讓每item的所有信息,你可以使用:

dat <- feed %>% 
    xml_find_all("//item") %>% 
    map_df(~xml_children(.) %>% {set_names(xml_text(.), xml_name(.))} %>% t %>% as_tibble)