我終於找到了解決這個問題的方法。這是我的用例和我嘗試的。
這些都是來自維基百科使用rvest,所以應該不會有問題。全部包含%
,但不是全部都會導致問題。
#problem strings
problem_strs = c("Roscoe_%22Fatty%22_Arbuckle", "Michael_%22Atters%22_Attree",
"J%C3%BCrgen_Becker", "Vicco_von_B%C3%BClow", "B%C3%BClent_Ceylan",
"Se%C3%A1n_Cullen", "Chris_D%27Elia", "U%C4%9Fur_R%C4%B1fat_Karlova",
"Mike_Kr%C3%BCger", "Andr%C3%A9s_L%C3%B3pez_Forero", "Mo%27Nique",
"Jos%C3%A9_S%C3%A1nchez_Mota", "Dara_%C3%93_Briain", "Conan_O%27Brien",
"Mike_O%27Brien_(actor)", "Carroll_O%27Connor", "Donald_O%27Connor",
"Rosie_O%27Donnell", "Michael_O%27Donoghue", "Chris_O%27Dowd",
"Ardal_O%27Hanlon", "Catherine_O%27Hara", "Patrice_O%27Neal",
"Barunka_O%27Shaughnessy", "Raven-Symon%C3%A9", "Charles_%22Chic%22_Sale",
"No%C3%ABl_Wells", "%22Weird_Al%22_Yankovic", "Cem_Y%C4%B1lmaz"
)
首先嚐試base-r解決方案。這不是矢量出於某種原因,所以我們使用purrr:
#utils::URLdecode
problem_strs %>% purrr::map_chr(utils::URLdecode)
[1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker" "Vicco_von_Bülow"
[5] "Bülent_Ceylan" "Seán_Cullen" "Chris_D'Elia" "Uğur_Rıfat_Karlova"
[9] "Mike_Krüger" "Andrés_López_Forero" "Mo'Nique" "José_Sánchez_Mota"
[13] "Dara_Ã「_Briain" "Conan_O'Brien" "Mike_O'Brien_(actor)" "Carroll_O'Connor"
[17] "Donald_O'Connor" "Rosie_O'Donnell" "Michael_O'Donoghue" "Chris_O'Dowd"
[21] "Ardal_O'Hanlon" "Catherine_O'Hara" "Patrice_O'Neal" "Barunka_O'Shaughnessy"
[25] "Raven-Symoné" "Charles_\"Chic\"_Sale" "Noël_Wells" "\"Weird_Al\"_Yankovic"
[29] "Cem_Yılmaz"
如果我們之前比較這些到的人,我們可以看到的模式:那些2 %
的事業問題。作爲前
#urltools::url_decode
urltools::url_decode(problem_strs)
相同的結果:所以我讀的URL的R解碼相關的所有問題,並發現這些建議的解決方案。
什麼是編碼?嘗試設置爲UTF-8:
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> #try to set
> Encoding(problem_strs) = "UTF-8"
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> Encoding(problem_strs) = "utf8"
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
> urltools::url_decode(problem_strs)
與以前相同的輸出。
有人建議另一種方式來檢查和設置:
> problem_strs = iconv(problem_strs, from = "ASCII", to = "UTF-8")
> Encoding(problem_strs)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[25] "unknown" "unknown" "unknown" "unknown" "unknown"
而且我發現名單上的另一個包:
> #Ruchardet to detect?
> Ruchardet::detectEncoding(problem_strs)
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
#Is it simpler than we thought?
urltools::url_decode(problem_strs) %>% urltools::url_decode()
相同的輸出。
所以我搜索了一個導致問題的特定模式,如%C3%BC
。所以,there is a half-supplied answer here for php。
首先你需要urldecode它,這會給你,這是ü的UTF8編碼表示,所以你應該都很好。
OK,讓我們嘗試在R:
#url decode, then set utf
halfway = urltools::url_decode(problem_strs)
Encoding(halfway) = "UTF-8"
halfway
[1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker" "Vicco_von_Bülow"
[5] "Bülent_Ceylan" "Seán_Cullen" "Chris_D'Elia" "Uğur_Rıfat_Karlova"
[9] "Mike_Krüger" "Andrés_López_Forero" "Mo'Nique" "José_Sánchez_Mota"
[13] "Dara_Ó_Briain" "Conan_O'Brien" "Mike_O'Brien_(actor)" "Carroll_O'Connor"
[17] "Donald_O'Connor" "Rosie_O'Donnell" "Michael_O'Donoghue" "Chris_O'Dowd"
[21] "Ardal_O'Hanlon" "Catherine_O'Hara" "Patrice_O'Neal" "Barunka_O'Shaughnessy"
[25] "Raven-Symoné" "Charles_\"Chic\"_Sale" "Noël_Wells" "\"Weird_Al\"_Yankovic"
[29] "Cem_Yılmaz"
這裏是一個可重複使用的功能:
url_decode_utf = function(x) {
y = urltools::url_decode(x)
Encoding(y) = "UTF-8"
y
}
我不知道R輸入語言,但通常在解碼/編碼功能,您可以通過編碼,例如UrlDecode(Byte [],Encoding)。你檢查過URLdecode()的文檔嗎? – Alesanco
我試過了?URLdecode和?curlUnescape(也是這樣)的文檔,這些函數似乎沒有任何附加參數。 – swolf
問題不在於'URLdecode',而是使用默認編碼(和區域設置):您必須找到將其設置爲UTF-8的方法(我認爲它是除Windows之外的所有平臺上的默認設置)。 –