2013-07-30 82 views
2

這應該是一個簡單的方法。由多個字符編碼的字符的URL解碼%

讓我們假設我有這個字符串中R:

a <- "%C3%B6sterlich

這意味着:

österlich(這意味着德國的 '東風')

但是,如果我這樣做URLdecode(a),我得到:

[1] "österlich"

這樣做很有意義,因爲%C3是×,%B6是ASCII URL編碼中的¶。但正如你可以在這裏看到的: http://www.backbone.se/urlencodingUTF8.htm ,%C3%B6表示UTF-8編碼中的ö。

現在的問題:我如何告訴URLdecode()使用UTF-8表?

+0

我不知道R輸入語言,但通常在解碼/編碼功能,您可以通過編碼,例如UrlDecode(Byte [],Encoding)。你檢查過URLdecode()的文檔嗎? – Alesanco

+0

我試過了?URLdecode和?curlUnescape(也是這樣)的文檔,這些函數似乎沒有任何附加參數。 – swolf

+1

問題不在於'URLdecode',而是使用默認編碼(和區域設置):您必須找到將其設置爲UTF-8的方法(我認爲它是除Windows之外的所有平臺上的默認設置)。 –

回答

3

試試這個:

> Encoding(a) <- "UTF-8" 

或者使用的iconv功能:
http://stat.ethz.ch/R-manual/R-devel/library/base/html/iconv.htmlhttp://astrostatistics.psu.edu/datasets/2006tutorial/html/utils/html/iconv.html

希望它可以幫助^ _^

+0

嘿Alesanco,謝謝你的回答。然而,我的 - [R奇怪的行爲(在Windows): '一個< - 「%C3%B6sterlich」' '編碼(一)' [1] 「未知」 '編碼的(a)< - 「 UTF-8「' '編碼(a)' [1]」未知「 似乎,它根本沒有改變任何事情。與 相同 'a < - iconv(a,to =「UTF-8」)' – swolf

+0

和iconv(a,from =「ASCII」,to =「UTF-8」)? – Alesanco

+0

[這裏](http://developer.r-project.org/Encodings_and_R.html)一些可以在頁面末尾對Windows有用的信息...希望它有幫助! – Alesanco

1

我終於找到了解決這個問題的方法。這是我的用例和我嘗試的。

這些都是來自維基百科使用rvest,所以應該不會有問題。全部包含%,但不是全部都會導致問題。

#problem strings 
problem_strs = c("Roscoe_%22Fatty%22_Arbuckle", "Michael_%22Atters%22_Attree", 
    "J%C3%BCrgen_Becker", "Vicco_von_B%C3%BClow", "B%C3%BClent_Ceylan", 
    "Se%C3%A1n_Cullen", "Chris_D%27Elia", "U%C4%9Fur_R%C4%B1fat_Karlova", 
    "Mike_Kr%C3%BCger", "Andr%C3%A9s_L%C3%B3pez_Forero", "Mo%27Nique", 
    "Jos%C3%A9_S%C3%A1nchez_Mota", "Dara_%C3%93_Briain", "Conan_O%27Brien", 
    "Mike_O%27Brien_(actor)", "Carroll_O%27Connor", "Donald_O%27Connor", 
    "Rosie_O%27Donnell", "Michael_O%27Donoghue", "Chris_O%27Dowd", 
    "Ardal_O%27Hanlon", "Catherine_O%27Hara", "Patrice_O%27Neal", 
    "Barunka_O%27Shaughnessy", "Raven-Symon%C3%A9", "Charles_%22Chic%22_Sale", 
    "No%C3%ABl_Wells", "%22Weird_Al%22_Yankovic", "Cem_Y%C4%B1lmaz" 
) 

首先嚐試base-r解決方案。這不是矢量出於某種原因,所以我們使用purrr

#utils::URLdecode 
problem_strs %>% purrr::map_chr(utils::URLdecode) 

[1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker"   "Vicco_von_Bülow"   
[5] "Bülent_Ceylan"   "Seán_Cullen"    "Chris_D'Elia"    "Uğur_Rıfat_Karlova"  
[9] "Mike_Krüger"    "Andrés_López_Forero"  "Mo'Nique"     "José_Sánchez_Mota"  
[13] "Dara_Ã「_Briain"   "Conan_O'Brien"    "Mike_O'Brien_(actor)"  "Carroll_O'Connor"   
[17] "Donald_O'Connor"   "Rosie_O'Donnell"   "Michael_O'Donoghue"  "Chris_O'Dowd"    
[21] "Ardal_O'Hanlon"   "Catherine_O'Hara"   "Patrice_O'Neal"   "Barunka_O'Shaughnessy"  
[25] "Raven-Symoné"    "Charles_\"Chic\"_Sale"  "Noël_Wells"    "\"Weird_Al\"_Yankovic"  
[29] "Cem_Yılmaz" 

如果我們之前比較這些到的人,我們可以看到的模式:那些2 %的事業問題。作爲前

#urltools::url_decode 
urltools::url_decode(problem_strs) 

相同的結果:所以我讀的URL的R解碼相關的所有問題,並發現這些建議的解決方案。

什麼是編碼?嘗試設置爲UTF-8:

> Encoding(problem_strs) 
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" 
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" 
[25] "unknown" "unknown" "unknown" "unknown" "unknown" 
> #try to set 
> Encoding(problem_strs) = "UTF-8" 
> Encoding(problem_strs) 
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" 
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" 
[25] "unknown" "unknown" "unknown" "unknown" "unknown" 
> Encoding(problem_strs) = "utf8" 
> Encoding(problem_strs) 
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" 
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" 
[25] "unknown" "unknown" "unknown" "unknown" "unknown" 
> urltools::url_decode(problem_strs) 

與以前相同的輸出。

有人建議另一種方式來檢查和設置:

> problem_strs = iconv(problem_strs, from = "ASCII", to = "UTF-8") 
> Encoding(problem_strs) 
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" 
[13] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" 
[25] "unknown" "unknown" "unknown" "unknown" "unknown" 

而且我發現名單上的另一個包:

> #Ruchardet to detect? 
> Ruchardet::detectEncoding(problem_strs) 
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" 

#Is it simpler than we thought? 
urltools::url_decode(problem_strs) %>% urltools::url_decode() 

相同的輸出。

所以我搜索了一個導致問題的特定模式,如%C3%BC。所以,there is a half-supplied answer here for php

首先你需要urldecode它,這會給你,這是ü的UTF8編碼表示,所以你應該都很好。

OK,讓我們嘗試在R:

#url decode, then set utf 
halfway = urltools::url_decode(problem_strs) 
Encoding(halfway) = "UTF-8" 
halfway 
[1] "Roscoe_\"Fatty\"_Arbuckle" "Michael_\"Atters\"_Attree" "Jürgen_Becker"    "Vicco_von_Bülow"   
[5] "Bülent_Ceylan"    "Seán_Cullen"    "Chris_D'Elia"    "Uğur_Rıfat_Karlova"  
[9] "Mike_Krüger"    "Andrés_López_Forero"  "Mo'Nique"     "José_Sánchez_Mota"   
[13] "Dara_Ó_Briain"    "Conan_O'Brien"    "Mike_O'Brien_(actor)"  "Carroll_O'Connor"   
[17] "Donald_O'Connor"   "Rosie_O'Donnell"   "Michael_O'Donoghue"  "Chris_O'Dowd"    
[21] "Ardal_O'Hanlon"   "Catherine_O'Hara"   "Patrice_O'Neal"   "Barunka_O'Shaughnessy"  
[25] "Raven-Symoné"    "Charles_\"Chic\"_Sale"  "Noël_Wells"    "\"Weird_Al\"_Yankovic"  
[29] "Cem_Yılmaz"    

這裏是一個可重複使用的功能:

url_decode_utf = function(x) { 
    y = urltools::url_decode(x) 
    Encoding(y) = "UTF-8" 
    y 
}