在R的網頁抓取

我正在練習我在R的網頁抓取代碼，無論我嘗試什麼網站，我都無法通過一個階段。在R的網頁抓取

例如，

https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music

我的目標是提取所有77所學校的名字（從牛津到倫敦都市）

所以，我想......

library(rvest) 
url_college <- "https://www.thecompleteuniversityguide.co.uk/league-tables/rankings?s=Music" 
college <- read_html(url_college) 
info <- html_nodes(college, css = '.league-table-institution-name') 
info %>% html_nodes('.league-table-institution-name') %>% html_text()

從F12，我可以發現所有學校的名字都在班級'.league-table-institution-name'中......這就是爲什麼我用html_nodes寫出這個名字的原因......

我做錯了什麼？

來源

2017-03-28 wjang4

當你在等待的答案，你SHLD prbly閱讀https://www.thecompleteuniversityguide.co.uk/terms-and-conditions/ – hrbrmstr

您目前運行html_nodes()兩次：第一次上college，一個xml_document（這是正確的），然後在info，字符向量，這是不正確的。

試試這個：

url_college %>% 
    read_html() %>% 
    html_nodes('.league-table-institution-name') %>% 
    html_text()

，然後你需要一個額外的步驟來清理學校名稱;這一個建議：

%>% 
    str_replace_all("(^[^a-zA-Z]+)|([^a-zA-Z]+$)", "")

來源

2017-03-28 22:57:56 neilfws

我...但可以你請解釋爲什麼我們需要使用OR運算符|在str_replace_all（「（^ [^ a-zA-Z] +）|（[^ a-zA-Z] + $）」，「」）？是不是必須是AND運算符，因爲我們用「」來替換兩個模式？ – wjang4

也許誰建議編輯可以解釋:) – neilfws

在R的網頁抓取

回答

相關問題