2016-01-21 114 views
1

我需要從網頁上獲取一些數據。我試圖使用R軟件提取。在R中刮信息

原因的信息是在幾個頁面,首先我寫這篇文章的代碼:

require(XML) 
contador<-c(1:200) 
for(i in contador){ 
myURL<-paste("http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=",i,sep="") 
} 

其次,我閱讀下面的代碼的web_url:

web_url<-getURL(myURL) 
web_url<-readLines(tc<-textConnection(web_url));close(tc) 
webtree<-htmlTreeParse(web_url,error=function(...){}) 
body<-webtree$children$html$children$body 
body 

然而,當我執行以下命令我獲得一個錯誤:

precio<-xpathSApply(body,"//li[@class='label label-secondary text-bold']",xmlValue) 

Input is not proper UTF-8, indicate encoding ! 
Bytes: 0xC2 0x3C 0x2F 0x64 
Sequence ']]>' not allowed in content 
Sequence ']]>' not allowed in content 
internal error: detected an error in element content 

我試過不同的選擇,但我不'無法取消這些信息。

Tx您的意見!

回答

2

我猜你的xpath壞了。 假設您想要使用class='label label-secondary text-bold'讀取跨度,可以使用//span[contains(concat(" ", @class, " "), concat(" ", "text-bold", " "))]作爲xpath。

通過rvest

require(rvest) 
i <- 1 
myURL<-paste("http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=",i,sep="") 
doc <- read_html(myURL) 
doc %>% html_nodes(xpath='//span[contains(concat(" ", @class, " "), concat(" ", "text-bold", " "))]') %>% html_text() 

讀它你得到

[1] "51.000 €" "11.000 €" "50.000 €" "25.900 €" "48.000 €" "100.000 €" "60.000 €" "25.000 €" "20.888 €" 
[10] "29.999 €" "26.000 €" "11.000 €" "42.500 €" "12.000 €" "41.000 €" "30.500 €" "40.000 €" 

您可以通過lapply這樣做在一個循環如下:

doc <- lapply(1:10, function(x, base_url){ 
    read_html(paste0(base_url,x)) 
}, "http://www.europa-mop.com/excavadoras-usadas/2-1/anuncios-excavadoras.html?p=") 

lapply(doc, . %>% html_nodes(xpath='//span[contains(concat(" ", @class, " "), concat(" ", "text-bold", " "))]') %>% html_text()) 

,讓你用一個列表文字