webscraping：手動更換標籤

我正在處理和玩「rvest」。用「read_html」獲取數據是可以的。webscraping：手動更換標籤

library(rvest) 
# suppressMessages(library(dplyr)) 
library(stringr) 
library(XML) 

# get house data 
houseurl <- "http://boekhoff.de/immobilien/gepflegtes-zweifamilienhaus-in-ellwuerden/" 
house <- read_html(houseurl) 
house

我在處理數據時遇到了一些問題。來源評論我的問題。

## eleminating <br>-tags in address 
# using the following commands causes error using "html_nodes" 
str_extract_all(house,"<br>") ## show all linebreaks 
# replacing <br> in whitespace " ", 
house <- str_replace_all(house,"<br>", " ")

現在讀出細節，但似乎不起作用

houseattribut <- house %>% 
html_nodes(css = "div.col-2 li p.data-left") %>% 
html_text(trim=TRUE) 
# shows "Error in UseMethod("xml_find_all") : ... " 
# but all attributes are shown on screen 
houseattribut

沒有更換「BR」標籤都有效手動其工作，但「HTML_TEXT」收緊串在一起

housedetails <- house %>% 
html_nodes(css = "div.col-2 li p.data-right") %>% 
html_text() 
housedetails 
# the same error shows "Error in UseMethod("xml_find_all") : ... " 
# but all details are shown on screen 

housedetails[4] 
# in the source there is: "Ellwürder Straße 17<br>26954 Nordenham" 
# at <br>-tag should be a whitespace

任何提示我做錯了什麼？

來源

2017-01-23 wattnwurm

的問題是，當你使用read_html，house是xml_document，您使用str_replace_all後，它成爲一個chr，所以，當你嘗試再次篩選節點，它不是更多的是xml_document和它給你的錯誤。

您需要將其再次轉換爲xml_document或應用替換節點。

類似的東西：

house <- read_html(str_replace_all(house,"<br>", " "))

全碼：

library(rvest) 
#> Loading required package: xml2 
library(stringr) 

houseurl <- "http://boekhoff.de/immobilien/gepflegtes-zweifamilienhaus-in-ellwuerden/" 
house <- read_html(houseurl) 

house <- read_html(str_replace_all(house,"<br>", " ")) 

housedetails <- house %>% 
    html_nodes(css = "div.col-2 li p.data-right") %>% 
    html_text() 

housedetails[4] 
#> [1] "EllwÃ¼rder StraÃŸeÂ 17 26954Â Nordenham"

來源

2017-01-23 19:25:44

謝謝你，這就是我要找的。 – wattnwurm

webscraping：手動更換標籤

回答

相關問題