2012-03-31 63 views
3

我很難提取/black space之間的元素。我可以做到這一點,當我有兩個字符,如<>例如,但空間扔我。我希望最有效的方法來做到這一點,在基地R這將被添加到數以千計的載體。提取字符和空格之間的元素

我希望把這個:

x <- "This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG" 

此:

[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG" 

編輯:

謝謝大家的答案。我要加快速度,Andres代碼勝出。 Dwin的代碼贏得了最多的代碼。德克是第二快的。 stringr解決方案是最慢的(我想它會是這樣)並且不在基礎中,但是很容易理解(這實際上是我認爲的stringr包的意圖,因爲這似乎是Hadley的哲學與大多數事情有關

我很欣賞你的幫助再次感謝

我想我會包括標杆,因爲這將是lapplied超過幾千向量:。

test replications elapsed relative user.self sys.self 
1 ANDRES  10000 1.06 1.000000  1.05  0 
3 DIRK  10000 1.29 1.216981  1.20  0 
2 DWIN  10000 1.56 1.471698  1.43  0 
4 FLODEL  10000 8.46 7.981132  7.70  0 

回答

5

類似,但有點更簡潔:

#1- Separate the elements by the blank space 

    y=unlist(strsplit(x,' ')) 

#2- extract just what you want from each element: 

    sub('^.*/([^ ]+).*$','\\1',y) 

在哪裏開始和結束的錨文字 被^$分別.*匹配任何字符。 [^ ]+取非空白字符。 \\1是第一個帶標記的字符

+0

我喜歡的那一個。緊湊,不需要查看中間結果的長度。 – 2012-03-31 20:40:29

+0

正則表達式讓我需要啤酒。 – 2012-03-31 20:55:11

2

這裏是一個班輪:

R> x <- paste("This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG" 
       "of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG" 
R> matrix(do.call(c, strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")), 
+   ncol=2, byrow=TRUE)[,2] 
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG" 
R> 

的關鍵是「斜槓前文」擺脫:

R> gsub("[a-zA-Z.,]*/", " ", x) 
[1] " DT VBZ DT JJ NN VBG IN DT JJ NNS CC VBG" 
R> 

後,它僅僅是一個分裂的字符串

R> strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ") 
[[1]] 
[1] "" "DT" "" "VBZ" "" "DT" "" "JJ" "" "NN" 
[11] "" "VBG" "" "IN" "" "DT" "" "JJ" "" "NNS" 
[21] "" "CC" "" "VBG" 

和過濾""的問題。最後一點可能會有更緊湊的方式。 R>

3

使用正則表達式是FWD-斜線或空格:

strsplit(x, "/|\\s") 
[[1]] 
[1] "This"  "DT"   "is"   "VBZ"   "a"   "DT"   "short"  
[8] "JJ"   "sentence" "NN"   "consisting" "VBG"   "of"   "IN"   
[15] "some"  "DT"   "nouns,"  "JJ"   "verbs,"  "NNS"   "and"   
[22] "CC"   "adjectives." "VBG" 

沒有讀取Q不夠緊密。人們可以使用該結果來提取偶數元素:

strsplit(x, "/|\\s")[[1]][seq(2, 24, by=2)] 
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG" 
1

stringr包具有很好的函數來處理字符串,名字非常直觀。在這裏,您可以使用str_extract_all讓所有的比賽(包括斜線),然後str_sub刪除斜線:​​

str_extract_all(x, "/\\w*") 
# [[1]] 
# [1] "/DT" "/VBZ" "/DT" "/JJ" "/NN" "/VBG" "/IN" "/DT" "/JJ" "/NNS" 
# [11] "/CC" "/VBG" 

str_sub(str_extract_all(x, "/\\w*")[[1]], start = 2) 
# [1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG" 
相關問題