2013-11-26 31 views
3

我試圖讓選擇列表中this page什麼是正確的xpath來刮這個網頁?

$("#Lastname"),$(".intro"),.... 

這裏使用xpathSApply我嘗試:

library(XML) 
library(RCurl) 
a <- getURL('http://www.w3schools.com/jquery/trysel.asp') 
doc <- htmlParse(a) 
xpathSApply(doc,'//*[@id="selectorOptions"]') ## I can't get the right xpath 

我也試過,但沒有成功:

xpathSApply(doc,'//*[@id="selectorOptions"]/div[i]') 

編輯我添加python標籤,因爲我也接受python解決方案。

+0

JavaScript正在運行在此頁上c reate你正在尋找的內容。例如'var w3SelDescriptions = []; w3SelDescriptions.push('id =「Lastname」'的元素');' 您需要從瀏覽器或類似的東西獲取javascript頁面。 – jdharrison

+0

@jdharrison恐怕我不明白你的觀點。你的意思是選擇器是由這個調用創建的:'onload =「w3jQuerySelectorLoad()'? – agstudy

+0

選擇器列表是由一段javascript代碼創建的 – jdharrison

回答

4

以下是R的方式來獲得像這樣的JavaScript頁面。您需要使用@Peyton指出的瀏覽器。 Selenium服務器是控制瀏覽器的好方法。我寫的R硒服務器某些綁定在 https://github.com/johndharrison/RSelenium

下將允許人們訪問後JavaScript源:

require(devtools) 
devtools::install_github("RSelenium", "johndharrison") 
library(RSelenium) 
library(RJSONIO) 

# one needs to have an active server running 
# the following commented out lines source the latest java binary 
# RSelenium::checkForServer() 
# RSelenium::startServer() 
# a selenium server is assummed to be running now 

remDR <- remoteDriver$new() 
remDR$open() # opens a browser usually firefox with default settings 
remDR$navigate('http://www.w3schools.com/jquery/trysel.asp') # navigate to your page 
webElem <- remDR$findElements(value = "//*[@id='selectorOptions']") # find your elememts 

# display the appropriate quantities 
cat(fromJSON(webElem[[1]]$getElementText())$value) 
> cat(fromJSON(webElem[[1]]$getElementText())$value) 
$("#Lastname") 
$(".intro") 
$(".intro, #Lastname") 
$("h1") 
$("h1, p") 
$("p:first") 
$("p:last") 
$("tr:even") 
$("tr:odd") 
$("p:first-child") 
$("p:first-of-type") 
$("p:last-child") 
$("p:last-of-typ 
..................... 

UPDATE:訪問信息

更簡單的方法在這種情況下是使用executeScript方法

library(RSelenium) 
RSelenium:startServer() 
remDr$open() 
remDR$navigate('http://www.w3schools.com/jquery/trysel.asp') 
remDr$executeScript("return w3Sels;")[[1]] 

> remDr$executeScript("return w3Sels;")[[1]] 
[1] "#Lastname"    ".intro"     
[3] ".intro, #Lastname"  "h1"      
[5] "h1, p"     "p:first"    
[7] "p:last"     "tr:even"    
[9] "tr:odd"     "p:first-child"   
[11] "p:first-of-type"  "p:last-child"   
[13] "p:last-of-type"   "li:nth-child(1)"  
[15] "li:nth-last-child(1)" "li:nth-of-type(2)"  
[17] "li:nth-last-of-type(2)" "b:only-child"   
[19] "h3:only-of-type"  "div > p"    
[21] "div p"     "ul + h3"    
[23] "ul ~ table"    "ul li:eq(0)"   
[25] "ul li:gt(0)"   "ul li:lt(2)"   
[27] ":header"    ":header:not(h1)"  
[29] ":animated"    ":focus"     
[31] ":contains(Duck)"  "div:has(p)"    
[33] ":empty"     ":parent"    
[35] "p:hidden"    "table:visible"   
[37] ":root"     "p:lang(it)"    
[39] "[id]"     "[id=my-Address]"  
[41] "p[id!=my-Address]"  "[id$=ess]"    
[43] "[id|=my]"    "[id^=L]"    
[45] "[title~=beautiful]"  "[id*=s]"    
[47] ":input"     ":text"     
[49] ":password"    ":radio"     
[51] ":checkbox"    ":submit"    
[53] ":reset"     ":button"    
[55] ":image"     ":file"     
[57] ":enabled"    ":disabled"    
[59] ":selected"    ":checked"    
[61] "*" 
+0

謝謝!我以前沒有聽說過硒!但我得到一個錯誤'函數錯誤(類型,msg,asError = TRUE):無法連接到主機'。也許是因爲Firefox不是我的默認瀏覽器? – agstudy

+0

您是否正在運行服務器。您需要運行'#RSelenium :: checkForServer() #RSelenium :: startServer()'。我將這些行註釋掉了,因爲我自己包括的許多人不習慣從R下載和運行外部二進制文件。這會從http://code.google.com/p/selenium/下載二進制文件。 startServer會運行這個二進制文件。如果你不想使用包中的內置命令,你可以自己去頁面下載服務器並確保它正在運行。 – jdharrison

+0

Python有能力運行Selenium,我相信它是官方支持的,所以如果你使用Python很舒服,這將是一個很好的選擇。 – jdharrison

0

感謝jdharrison評論我解析了JavaScript代碼以提取所有選擇器。正如Peyton所提到的,由於所有的選擇器都在代碼中,所以在這個特殊情況下工作。

capture.output(xpathSApply(doc,'//*/script')[[6]], 
       file='test.js') 
ll <- readLines('test.js') 
ll <- ll[grepl('w3Sels.push',ll)] 
ll <- unlist(regmatches(ll, gregexpr("(?<=\\().*?(?=\\))", ll, perl=T))) 

cat(head(ll)) 
"#Lastname" ".intro" ".intro, #Lastname" "h1" "h1, p" "p:first" 
相關問題