2015-05-21 80 views
1

欲分割文本和我下面的例子1:ř編程strsplit():不希望的結果

實施例1:

> x <- "Split the words in a sentence." 
> strsplit(x, " ") 

[[1]] 
[1] "Split"  "the"  "words"  "in"  
[5] "a"   "sentence." 

所以我試圖分裂NewString:

> NewString 
[1] "s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 " 
> strsplit(NewString,' ') 
[[1]] 
[1] "s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 " 

不會拆分text.The奇怪的功能是,如果複製NewString的輸出並粘貼到strsplit():

>strsplit("s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 ",' ') 
[[1]] 
[1] "s14" "v13" "s13" "s13" "v12" "s12" "v11" "s11" "v10" "s10" "s10" "v09" "s09" 
[14] "v08" "s08" "v07" "s07" "v06" "s06" "v05" "s05" "v04" "s04" "v03" "s03" "v02" 
[27] "s02" "s01" "v00" 

可能是什麼問題?

(該NewString通過使用rvest包outputed)

編輯: CharToRaw給出以下輸出:

> charToRaw(lol) 
[1] 73 31 34 c2 a0 76 31 33 c2 a0 73 31 33 c2 a0 73 31 33 c2 a0 76 31 32 c2 a0 
[26] 73 31 32 c2 a0 76 31 31 c2 a0 73 31 31 c2 a0 76 31 30 c2 a0 73 31 30 c2 a0 
[51] 73 31 30 c2 a0 76 30 39 c2 a0 73 30 39 c2 a0 76 30 38 c2 a0 73 30 38 c2 a0 
[76] 76 30 37 c2 a0 73 30 37 c2 a0 76 30 36 c2 a0 73 30 36 c2 a0 76 30 35 c2 a0 
[101] 73 30 35 c2 a0 76 30 34 c2 a0 73 30 34 c2 a0 76 30 33 c2 a0 73 30 33 c2 a0 
[126] 76 30 32 c2 a0 73 30 32 c2 a0 73 30 31 c2 a0 76 30 30 c2 a0 
+0

什麼'str(NewString)'和'dput(NewString)'說? – lukeA

+0

'> STR(NewString) CHR 「S14 V13 S13 S13 V12 S12 V11 S11 V10 S10 S10 V09 S09 V08 S08 V07 S07 V06 S06 V05 S05 V04 S04 V03 S03 V02 S02 S01 V00」 > dput(NewString) 「S14 V13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00「' –

+0

然後,'strsplit(NewString,'')'產生您的最後一個輸出。 – lukeA

回答

2

這可以使用stringi包和​​來完成。

首先讓使由同一字符分隔字符串(一百六十零分之一百九十四是十六進制C2A0):

s=rawToChar(as.raw(c(65,66,48,194, 160,65,67,49,194,160,65,68,50))) 

> s 
[1] "AB0 AC1 AD2" 

普通str_split不起作用:

> str_split(s,"\\s+") 
[[1]] 
[1] "AB0 AC1 AD2" 

但安裝stringi和:

> stri_split(s,regex="\\s+") 
[[1]] 
[1] "AB0" "AC1" "AD2" 

我懷疑stringi有更廣泛的c關於什麼是空白(\ s)。

相關問題