2017-10-04 36 views
0

我有一個艱難的時間獲取價值,因爲有些網頁已經失蹤標籤:結果 - 貓Rvest:刮數據時元素不存在

我已經訪問過這個問題here,但我仍然不能夠抓取數據。

HTML

<div class="result "> 
    <span class="result-txt"> 

     <span class="result-name"> 
      <a href="/some/value/">COMPANY_NAME</a> 
      <a class="result-icons" href="/some/value/COMPANY_NAME_/"> 
       <span title="Info" class="sprite sprite-info">Info</span> 
       <span title="Phone" class="sprite sprite-phone">Phone</span> 
      </a> 
     </span> 

     <em> 
      <a href="/some/value/">LOCATION</a> 
      <span> ADDRESS </span> 
     </em> 

     <span class="result-cats"> 
      <a href="/some/value/" title="CAT1">CAT1</a> 
      <a href="/some/value/" title="CAT2">CAT2</a> 
     </span> 

    </span> 
</div> 

我想下面的代碼,但它給我的錯誤,因爲有些網頁沒有結果的貓標籤。因此,數據幀具有向量長度的失配

代碼

library(rvest) 
library(XML) 
library(stringi) 

df <- data.frame(CompanyName = NULL, CompanyLink = NULL, Address = NULL, cats = NULL) 

for(i in 1:100){ 

    print(paste("Page: ", i, sep = "")) 

    url <- "url.com" 
    page <- read_html(url) 

    CompanyNameNode <- html_nodes(page,'.result-name a:nth-child(1)') 
    CompanyName <- html_text(CompanyNameNode) 
    CompanyLink <- html_attr(CompanyNameNode, 'href') 

    Address <- html_text(html_nodes(page,'.result-txt em')) 
    Address <- gsub("[\r\n]", "", Address) 

    cats <- html_text(html_nodes(page,'.result-cats')) 
    cats <- stri_trim(cats) 
    cats <- gsub("[\r\n]", ", ", cats) 

    df <- rbind(df, data.frame(CompanyName = CompanyName, 
          CompanyLink = CompanyLink, 
          Address = Address, 
          cats = cats)) 

} 

UPDATE:使用以下代碼

pg <- html_nodes(page,'.result-txt') 
cats <- NULL 

for(j in 1:length(pg)){ 
    cats[j] <- ifelse(length(html_text(html_nodes(pg[j],'.result-cats')))==0, 
        NA, 
        html_text(html_nodes(pg[j],'.result-cats'))) 
} 

cats <- stri_trim(cats) 
cats <- gsub("[\r\n]", ", ", cats) 

回答

1

使用以下代碼

pg <- html_nodes(page,'.result-txt') 
cats <- NULL 

for(j in 1:length(pg)){ 
    cats[j] <- ifelse(length(html_text(html_nodes(pg[j],'.result-cats')))==0, 
        NA, 
        html_text(html_nodes(pg[j],'.result-cats'))) 
} 

cats <- stri_trim(cats) 
cats <- gsub("[\r\n]", ", ", cats) 
解決的問題已解決的問題