Rselenium - 如何從沒有id或任何名稱的網頁刮取數據

我目前正在嘗試從特定網站（http://www.faunaeur.org/?no_redirect=1）中刮取生物多樣性數據。我設法得到了一些成果，但不是自動的，因爲我希望...的第一部分完成，這是通過網站瀏覽：Rselenium - 如何從沒有id或任何名稱的網頁刮取數據

設置Rselenium：

library(RSelenium) 
download.file("https://github.com/mozilla/geckodriver/releases/download/v0.11.1/geckodriver-v0.11.1-win64.zip",destfile="./gecko.zip") 
unzip("./gecko.zip",exdir=".",overwrite=T) 
checkForServer(update=T) 
selfserv = startServer() 
mybrowser1 = remoteDriver(browserName="firefox",extraCapabilities = list(marionette = TRUE)) 
mybrowser1$open()

然後我開始瀏覽（這將是巴利阿里羣島爲例）：

mybrowser1$navigate("http://www.faunaeur.org/distribution.php?current_form=species_list") 
mybrowser1$findElement(using="xpath","//select[@name='taxon_rank']/option[@value='7']")$clickElement() # Class 
mybrowser1$findElement(using="xpath","//input[@name='taxon_name']")$sendKeysToElement(list('Oligochaeta')) # Oligochète 
mybrowser1$findElement(using="xpath","//select[@name='region']/option[@value='15']")$clickElement() 
mybrowser1$findElement(using="xpath","//input[@name='include_doubtful_presence']")$clickElement() 
mybrowser1$findElement(using="xpath","//input[@name='submit2']")$clickElement()

從這點我可以用下載20個亞種的XLS文件：

mybrowser1$findElement(using = "xpath", "//a[@href='JavaScript:document.export_species_list.submit()']")$clickElement()

但這不是我想要的，我不想使用「點擊」。是否可以在我的R環境中直接從此JavaScript鏈接下載文件，或者直接使用Rselenium從網頁的源代碼中刪除20個亞種的表格？

我試過這兩個解決方案，但它是一個僵局......最大的問題是，頁面是一個臨時頁面或'結果頁'，似乎我無法找到任何@value，@id， @name或@class對應於我需要的表。

解決方案的任何線索都暗示了通過R進行自動化的方式？我需要這種形式，因爲腳本必須由需要自行創建結果的人員運行。提前致謝！

來源

2017-01-05 Gin Ette

是的，你將需要設置適當的Firefox選項請參閱http://stackoverflow.com/questions/36574012/rselenium-setting-makefirefoxprofile-for-mac-os-x-to-download-files-without-ask。然後將xls文件下載到您指定的目錄 – jdharrison

我確實已經檢查過了。只是想知道是否還有其他有效的解決方案......既然你是Rselenium的開發者，jdharrison，我不認爲我會得到更好的答案！謝謝 –

如果你只是想這是網站這樣可以在不Rselenium通過httr上做顯示如下表：它給你

require(rvest) 
require(httr) 
res <- POST("http://www.faunaeur.org/species_list.php", 
      encode = "form", 
      body = list(selected_regions="15", 
         show_what="species list", 
         referring_page="distribution", 
         taxon_rank="7", 
         taxon_name="Oligochaeta", 
         region="15", 
         include_doubtful_presence="yes", 
         submit2="Display Species", 
         show_what="species list", 
         species_or_higher_taxa="species")) 
doc <- res %>% read_html 
dat <- doc %>% html_table(fill=TRUE,) %>% .[[9]] 
colnames(dat) <- dat[1,] 
dat <- dat[-1, ]

：

  Family      Species/subspecies 
2 Acanthodrilidae  Microscolex dubius (Fletscher 1887) 
3 Enchytraeidae  Enchytraeus buchholzi Vejdovsky 1878 
4 Enchytraeidae  Fridericia berninii Dozsa-Farkas 1988 
5 Enchytraeidae   Fridericia caprensis Bell 1947 
... 
21  Naididae   Aulophorus furcatus (Oken 1815)

來源

2017-01-05 17:25:22 Rentrop

這太棒了！我的錯誤是我將重點放在Rselenium軟件包上，我沒有意識到這可以通過rvest＆httr來完成......感謝FlooO！ –

Rselenium - 如何從沒有id或任何名稱的網頁刮取數據

回答

相關問題