0
我有一個艱難的時間獲取價值,因爲有些網頁已經失蹤標籤:結果 - 貓Rvest:刮數據時元素不存在
我已經訪問過這個問題here,但我仍然不能夠抓取數據。
HTML:
<div class="result ">
<span class="result-txt">
<span class="result-name">
<a href="/some/value/">COMPANY_NAME</a>
<a class="result-icons" href="/some/value/COMPANY_NAME_/">
<span title="Info" class="sprite sprite-info">Info</span>
<span title="Phone" class="sprite sprite-phone">Phone</span>
</a>
</span>
<em>
<a href="/some/value/">LOCATION</a>
<span> ADDRESS </span>
</em>
<span class="result-cats">
<a href="/some/value/" title="CAT1">CAT1</a>
<a href="/some/value/" title="CAT2">CAT2</a>
</span>
</span>
</div>
我想下面的代碼,但它給我的錯誤,因爲有些網頁沒有結果的貓標籤。因此,數據幀具有向量長度的失配
代碼
library(rvest)
library(XML)
library(stringi)
df <- data.frame(CompanyName = NULL, CompanyLink = NULL, Address = NULL, cats = NULL)
for(i in 1:100){
print(paste("Page: ", i, sep = ""))
url <- "url.com"
page <- read_html(url)
CompanyNameNode <- html_nodes(page,'.result-name a:nth-child(1)')
CompanyName <- html_text(CompanyNameNode)
CompanyLink <- html_attr(CompanyNameNode, 'href')
Address <- html_text(html_nodes(page,'.result-txt em'))
Address <- gsub("[\r\n]", "", Address)
cats <- html_text(html_nodes(page,'.result-cats'))
cats <- stri_trim(cats)
cats <- gsub("[\r\n]", ", ", cats)
df <- rbind(df, data.frame(CompanyName = CompanyName,
CompanyLink = CompanyLink,
Address = Address,
cats = cats))
}
UPDATE:使用以下代碼
pg <- html_nodes(page,'.result-txt')
cats <- NULL
for(j in 1:length(pg)){
cats[j] <- ifelse(length(html_text(html_nodes(pg[j],'.result-cats')))==0,
NA,
html_text(html_nodes(pg[j],'.result-cats')))
}
cats <- stri_trim(cats)
cats <- gsub("[\r\n]", ", ", cats)