R的UTF-8編碼問題

嘗試解析來自墨西哥參議院的參議院聲明，但無法處理網頁的UTF-8編碼。R的UTF-8編碼問題

這個網站來通過明確：

library(rvest) 
Senate<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/19675-version-estenografica-de-la-reunion-ordinaria-de-las-comisiones-unidas-de-puntos-constitucionales-de-anticorrupcion-y-participacion-ciudadana-y-de-estudios-legislativos-segunda.html")

這裏有點網頁的例子：

"CONTINÚA EL SENADOR CORRAL JURADO: Nosotros decimos. Entonces, bueno, el tema es que hay dos rutas señor presidente y también tratar, por ejemplo, de forzar ahora. Una decisión de pre dictamen a lo mejor lo único que va a hacer es complicar más las cosas."

可以看出，無論是口音和「N」來通過精。

問題出現在其他一些htmls（同一個域！）中。例如：

Senate2<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html")

我得到：

"-EL C. DIPUTADO ADAME ALEMÃƒÂN: En consecuencia estÃƒÂ¡ a discusiÃƒÂ³n la propuesta. Y para hablar sobre este asunto, se le concede el uso de la palabra a la senadoraÃ¢Â€Â¦Ã¢Â€Â¦.."

在這第二件我試過的iconv（）和強迫對HTML編碼參數（）來編碼= 「UTF-8」，但保持獲得相同的結果。

我也檢查了使用W3 Validator的網頁編碼，它似乎是UTF-8，沒有問題。

使用GSUB似乎並不有效，因爲編碼下載具有相同的「碼」不同的人物：

í - ÃƒÂ 
á - ÃƒÂ 
ó - ÃƒÂ

相當多的新鮮的想法。

> sessionInfo() 
R version 3.1.2 (2014-10-31) 
Platform: x86_64-w64-mingw32/x64 (64-bit) 

locale: 
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 
[4] LC_NUMERIC=C       LC_TIME=English_United States.1252  

attached base packages: 
[1] grDevices utils  datasets graphics stats  grid  methods base  

other attached packages: 
[1] stringi_0.4-1 magrittr_1.5  selectr_0.2-3 rvest_0.2.0  ggplot2_1.0.0 geosphere_1.3-11 fields_7.1  
[8] maps_2.3-9  spam_1.0-1  sp_1.0-17  SOAR_0.99-11  data.table_1.9.4 reshape2_1.4.1 xlsx_0.5.7  
[15] xlsxjars_0.6.1 rJava_0.9-6  

loaded via a namespace (and not attached): 
[1] bitops_1.0-6  chron_2.3-45  colorspace_1.2-4 digest_0.6.8  evaluate_0.5.5 formatR_1.0  gtable_0.1.2  
[8] httr_0.6.1  knitr_1.8  lattice_0.20-29 MASS_7.3-35  munsell_0.4.2 plotly_0.5.17 plyr_1.8.1  
[15] proto_0.3-10  Rcpp_0.11.3  RCurl_1.95-4.5 RJSONIO_1.3-0 scales_0.2.4  stringr_0.6.2 tools_3.1.2  
[22] XML_3.98-1.1

UPDATE： 這似乎是這個問題：

stri_enc_mark(Senate2) 
[1] "ASCII" "latin1" "latin1" "ASCII" "ASCII" "latin1" "ASCII" "ASCII" "latin1"

...等等。很顯然，問題在拉丁語言中：

stri_enc_isutf8(texto2) 
    [1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE

如何強制latin1糾正UTF-8字符串？當由stringi「翻譯」它似乎做錯了，給我前面描述的問題。

來源

2015-03-31 eflores89

@Pascal：「我試過iconv（）並強制將html（）的編碼參數設爲encoding =」UTF-8「，但一直得到相同的結果。」 – eflores89 2015-04-01 06:46:16

編碼是21世紀更令人頭痛的問題之一。但這裏有一個解決方案爲您：

# Set-up remote reading connection, specifying UTF-8 as encoding. 
addr <- "http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html" 
read.html.con <- file(description = addr, encoding = "UTF-8", open = "rt") 

# Read in cycles of 1000 characters 
html.text <- c() 
i = 0 
while(length(html.text) == i) { 
    html.text <- append(html.text, readChar(con = read.html.con,nchars = 1000)) 
    cat(i <- i + 1) 
} 

# close reading connection 
close(read.html.con) 

# Paste everything back together & at the same time, convert from UTF-8 
# to... UTF-8 with iconv(). I know. It's crazy. Encodings are secretely 
# meant to drive us insane. 
content <- paste0(iconv(html.text, from="UTF-8", to = "UTF-8"), collapse="") 

# Set-up local writing 
outpath <- "~/htmlfile.html" 

# Create file connection specifying "UTF-8" as encoding, once more 
# (Although this one makes sense) 
write.html.con <- file(description = outpath, open = "w", encoding = "UTF-8") 

# Use capture.output to dump everything back into the html file 
# Using cat inside it will prevent having [1]'s, quotes and such parasites 
capture.output(cat(content), file = write.html.con) 

# Close the output connection 
close(write.html.con)

然後你就可以打開你的新創建的文件在你喜歡的瀏覽器。您應該完整地看到它，並準備好用您選擇的工具重新打開！

來源

2015-04-01 11:54:44

我真的不知道爲什麼，但它的作品就像一個魅力！謝謝！ – eflores89 2015-04-01 15:20:05

我希望我能告訴你我的確如此，但說實話，那個iconv扭曲，我只是不知道它做了什麼。它的工作原理，從那裏開始，我很高興。：d – 2015-04-01 15:22:56

我想我有一個想法，多米尼克的扭曲。看到一個related topic here回答哈德利

你的問題幾乎肯定是UTF-8文件帶有a BOM馬克。 BOM被引入到R 3.0.0，許多軟件包不處理它們。通常的解決方法一直是將文件保存在文本文件中，在處理BOM的程序（如Windows記事本或OpenOffice Calc）中打開它，重新保存它，然後重新打開它。骯髒的伎倆，但它可以重現，因爲基本的R read.table/read.csv家庭現在明確可以處理這個問題。

read.csv(..., fileEncoding = "UTF-8-BOM")

我覺得多米尼克的把戲與此有關。有人說UTF-8-BOM是一個遺留問題，將會消失，但我不這麼認爲，所以我認爲如果能有更明確的方法來解決這個問題，那將是非常好的。

您可以隨時檢查在OpenOffice Calc中，在記事本上的Windows中是否出現亂碼的UTF-8，或者在write.csv/read.csv或其他文本寫入/讀取功能之後讀取數據。

來源

2016-10-08 08:27:48

R的UTF-8編碼問題

回答

相關問題