我用下面的網站工作:http://www.crowdrise.com/skollsechallengeR:嘗試使用不公開到webscrape(xpathSApply())導致NULL
具體此頁面上有57個集資活動。這些衆籌活動中的每一個都有文字,詳細說明他們爲什麼要籌集資金,迄今籌集的資金總額以及團隊成員。一些活動還指定了籌款目標。我想編寫一些R代碼,以便從57個站點中分別獲取這些信息。
爲了想出一個包含所有這些信息對於每個57家公司的表,我首先生成功能,讓我來提取每個57個活動的名稱:
#import packages
library("RCurl")
library("XML")
library("stringr")
url <- "http://www.crowdrise.com/skollSEchallenge"
url.data <- readLines(url)
#the resulting url.data is a character string
#remove spaces
url.data <- gsub('\r','', gsub('\t','', gsub('\n','', url.data)))
index.list <- grep("username:",url.data)
#index.list is a list of integers that indicates indexes of url.data that includes name
#of each of the 57 campaigns
length.index.list<-length(index.list)
length.index.list
vec <-vector()
#store the 57 usernames in one vector
for(i in 1:length.index.list){
username<-url.data[index.list[i]]
real.username <- gsub("username:","",username)
vec[i] <- c(real.username)
}
和然後我嘗試做一個循環讓R訪問57個活動網頁中的每一個,並進行網頁瀏覽。
# Extract all necessary paragraphs. Unlist flattens the list to
#create a character vector.
for(i in 1:length(vec)){
end.name<-gsub('\'','',vec[i])
end.name<-gsub(',','',end.name)
end.name<-gsub(' ','',end.name)
user.address<-paste(c("http://www.crowdrise.com/skollSEchallenge/",
end.name),collapse='')
user.url<-getURL(user.address)
html <- htmlTreeParse(user.url, useInternalNodes = TRUE)
website.donor<-unlist(xpathSApply(html,'//div[@class="grid1-4 "]//h4', xmlValue))
website.title<-unlist(xpathSApply(html,'//div[@class="project_info"]',xmlValue))
website.story<-unlist(xpathSApply(html,'//div[@id="thestory"]',xmlValue))
website.fund<-unlist(xpathSApply(html,'//div[@class="clearfix"]',xmlValue))
#(NOTE: doc.text<- readHTMLTable(webpage1) doesn't work
#due to the poor html structure of the website)
# Replace all \n by spaces, and eliminate all \t
website.donor <- gsub('\\n', ' ', website.donor)
website.donor <- gsub('\\t','',website.donor)
website.title <- gsub('\\n', ' ', website.title)
website.title <- gsub('\\t','',website.title)
website.story <- gsub('\\n', ' ', website.story)
website.story <- gsub('\\t','',website.story)
website.fund <- gsub('\\n', ' ', website.fund)
website.fund <- gsub('\\t','',website.fund)
## all those tabs and spaces are just white spaces that we can trim
website.title <- str_trim(website.title)
website.fund <- str_trim(website.fund)
website.data<- cbind(website.title, website.story, website.fund, website.donor)
data[[i]]<- website.data
Sys.sleep(1)
}
data <- data.frame(do.call(rbind,data), stringAsFactors=F)
命令
unlist(xpathSApply(html,'//div[@class="grid1-4 "]//h4', xmlValue))
unlist(xpathSApply(html,'//div[@class="project_info"]',xmlValue))
unlist(xpathSApply(html,'//div[@id="thestory"]',xmlValue))
unlist(xpathSApply(html,'//div[@class="clearfix"]',xmlValue))
是給我NULL值,我不明白爲什麼。
他們爲什麼會變成NULL,我該如何解決?
謝謝你,
在第一猜測,如果XPath查詢不匹配xpathSApply(這可能不應該被包裹在不公開,但是這不是問題)什麼那麼它將返回一個空列表。但是更廣泛地說,我應該說你應該重構問題來問一個特定的R/XML相關的問題,現在這個問題非常狹窄,而且與你編寫scraper的方式有關,而不是一些通用的問題。 –