搜索字符串中的unicode值

我試圖在由字符串組成的數據框中標識唯一的unicode值。我已經在使用grep功能試過，但我會遇到以下錯誤搜索字符串中的unicode值

Error: '\U' used without hex digits in character string starting ""\U"

甲示例數據幀

     time sender             message 
1  2012-12-04 13:40:00  1           Hello handsome! 
2  2012-12-04 13:40:08  1             \U0001f618 
3  2012-12-04 14:39:24  1             \U0001f603 
4  2012-12-04 16:04:25  2           <image omitted> 
73 2012-12-05 06:02:17  1 Haha not white and blue... White with blue eyes \U0001f61c 
40619 2015-05-08 10:00:58  1          \U0001f631\U0001f637 

grep("\U", dat$messages)

數據

dat <- 
structure(list(time = c("2012-12-04 13:40:00", "2012-12-04 13:40:08", 
"2012-12-04 14:39:24", "2012-12-04 16:04:25", "2012-12-05 06:02:17", 
"2015-05-08 10:00:58"), sender = c(1L, 1L, 1L, 2L, 1L, 1L), message = c("Hello handsome!", 
"\U0001f618", "\U0001f603", "<image omitted>", "Haha not white and blue... White with blue eyes \U0001f61c", 
"\U0001f631\U0001f637")), .Names = c("time", "sender", "message" 
), class = "data.frame", row.names = c("1", "2", "3", "4", "73", 
"40619"))

來源

2015-06-12 Andrews

我被「Unicode字符假設「你只是指非ASCII字符。字符代碼可能意味着不同的事情取決於編碼。 R表示使用特殊的\U序列的當前編碼以外的值。請注意，實際數據中不會出現斜槓或字母「U」。這就是他們如何在適當的字形不可用時逃脫到屏幕上打印的方式。

例如，即使最後的消息看起來像它的長，它實際上只有兩個字符長

dat$message[6] 
# [1] "\U0001f631\U0001f637" 
nchar(dat$message[6]) 
# [1] 2

您可以使用正則表達式很容易地非ASCII碼。 ASCII字符全部具有代碼0-128（或八進制的000至177）。你可以找到此範圍之外的值與

grep("[^\001-\177]", dat$message) 
# [1] 2 3 5 6

來源

2015-06-12 02:38:01 MrFlick

謝謝，這工作。那麼我將如何使用它來提取每行中的單個非ACSII字符？ – Andrews

提取你想要使用'gregexpr'而不是'grep'。例如：'m <-gregexpr（「[^ \ 001- \ 177]」，dat $ message）; regmatches（dat $ message，m）' – MrFlick

嘗試：

library(stringi) 
stri_enc_isascii(dat$message)

其中給出：

# [1] TRUE FALSE FALSE TRUE FALSE FALSE

來源

2015-06-12 02:44:03

搜索字符串中的unicode值

回答

相關問題