Rvest刮錯誤

這裏是我跑Rvest刮錯誤

library(rvest) 

rootUri <- "https://github.com/rails/rails/pull/" 
PR <- as.list(c(100, 200, 300)) 
list <- paste0(rootUri, PR) 
messages <- lapply(list, function(l) { 
    html(l) 
})

直到此時它似乎做工精細的代碼，但是當我嘗試提取文本：

html_text(messages)

我得到：

Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : 
    Unknown input of class: list

試圖提取一個特定的元素：

html_text(messages[1])

不能做，要麼...

Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : 
    Unknown input of class: list

於是我嘗試用不同的方式：

html_text(messages[[1]])

這似乎在數據至少可以得到，但仍然沒有成功的：

Error in UseMethod("xmlValue") : 
    no applicable method for 'xmlValue' applied to an object of class "c('HTMLInternalDocument',  'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument')"

如何從列表中的每個元素中提取文本材料？

來源

2014-12-05 histelheim

你爲什麼不使用GitHub的API？它有[pull requests]（https://developer.github.com/v3/pulls/）的動詞。 – hrbrmstr 2014-12-05 18:38:07

github API將註釋分成多個類別（問題，拉取請求，提交），這意味着我必須編寫一個相對複雜的查詢。在網絡上，我將所有這些集中在一個頁面中。 – histelheim 2014-12-05 19:21:08

有兩種您的代碼存在問題。 Look here for examples on how to use the package.

1.你不能只使用每一個功能的一切。

html()是下載內容
html_node()是從頁
html_text()是根據先前選擇的節點中提取文本的下載內容選擇節點（S）

因此，要下載其中一個頁面並提取html節點的文本，請使用以下代碼：

library(rvest)

老派風格：

url   <- "https://github.com/rails/rails/pull/100" 
url_content <- html(url) 
url_mainnode <- html_node(url_content, "*") 
url_mainnode_text <- html_text(url_mainnode) 
url_mainnode_text

...或...這

難以閱讀的老派風格：

url_mainnode_text <- html_text(html_node(html("https://github.com/rails/rails/pull/100"), "*")) 
url_mainnode_text

...或這個 ...

magritr管系風格

url_mainnode_text <- 
    html("https://github.com/rails/rails/pull/100") %>% 
    html_node("*") %>% 
    html_text() 
url_mainnode_text

2.使用時必須列出應用功能與例如列表lapply()

如果你想那種批處理幾個網址，你可以嘗試這樣的事：

url_list <- c("https://github.com/rails/rails/pull/100", 
        "https://github.com/rails/rails/pull/200", 
        "https://github.com/rails/rails/pull/300") 

    get_html_text <- function(url, css_or_xpath="*"){ 
     html_text(
     html_node(
      html("https://github.com/rails/rails/pull/100"), css_or_xpath 
     ) 
    ) 
    } 

lapply(url_list, get_html_text, css_or_xpath="a[class=message]")

來源

2014-12-05 19:17:50 petermeissner

你能幫我嗎？我無法提取這些值。 http://stackoverflow.com/questions/31423931/extract-data-from-raw-html-in-r – 2015-08-04 07:01:31

您需要使用html_nodes()並確定哪些CSS選擇器涉及到你感興趣的數據。例如，如果我們想抽取人的用戶名討論拉200

rootUri <- "https://github.com/rails/rails/pull/200" 
page<-html(rootUri) 
page %>% html_nodes('#discussion_bucket strong a') %>% html_text() 

[1] "jaw6"  "jaw6"  "josevalim"

來源

2014-12-05 19:07:04 keegan

回答

相關問題