2016-06-12 59 views
1

我希望有人能夠向我展示從字符向量中提取數據的方法。從數據框中的字符串中提取數字

的數據幀是如下

structure(list(Sensitivity = structure(c(1L, 5L, 4L, 4L, 4L, 
4L, 3L, 5L, 2L), .Label = c(" 1.01 [ 0.21, 2.91]", " 89.60 [ 85.56, 92.82]", 
" 92.95 [ 89.43, 95.59]", " 99.66 [ 98.14, 99.99]", " 100.00 [ 98.77, 100.00]" 
), class = "factor"), Specificity = structure(c(8L, 1L, 3L, 4L, 
2L, 5L, 6L, 1L, 7L), .Label = c(" 27.17 [ 25.15, 29.26]", " 44.96 [ 42.67, 47.26]", 
" 53.31 [ 51.00, 55.61]", " 69.90 [ 67.75, 71.99]", " 70.23 [ 68.08, 72.31]", 
" 90.18 [ 88.73, 91.50]", " 91.70 [ 90.35, 92.92]", " 100.00 [ 99.80, 100.00]" 
), class = "factor")), .Names = c("Sensitivity", "Specificity" 
), class = "data.frame", row.names = c(NA, -9L)) 

作爲一個例子爲第一列的第一列元素i將理想地得到的1.01,0.21和2.91三列數據。

第一個和第二個數值由「[」分隔,第二個和第三個由「,」分隔。我不是用grep做的,但已經嘗試過使用並且在某處出錯了!

回答

1

這裏是一個正則表達式解決方案,您可以使用該str_extract_allstringr包,在這裏我們使用\\d+\\.\\d+匹配十進制數,從一個或多個數字後面.和另外一個或多個數字模式開始嘗試。

library(stringr) 
lapply(df, function(col) do.call(rbind, str_extract_all(col, "\\d+\\.\\d+"))) 

$Sensitivity 
     [,1]  [,2] [,3]  
[1,] "1.01" "0.21" "2.91" 
[2,] "100.00" "98.77" "100.00" 
[3,] "99.66" "98.14" "99.99" 
[4,] "99.66" "98.14" "99.99" 
[5,] "99.66" "98.14" "99.99" 
[6,] "99.66" "98.14" "99.99" 
[7,] "92.95" "89.43" "95.59" 
[8,] "100.00" "98.77" "100.00" 
[9,] "89.60" "85.56" "92.82" 

$Specificity 
     [,1]  [,2] [,3]  
[1,] "100.00" "99.80" "100.00" 
[2,] "27.17" "25.15" "29.26" 
[3,] "53.31" "51.00" "55.61" 
[4,] "69.90" "67.75" "71.99" 
[5,] "44.96" "42.67" "47.26" 
[6,] "70.23" "68.08" "72.31" 
[7,] "90.18" "88.73" "91.50" 
[8,] "27.17" "25.15" "29.26" 
[9,] "91.70" "90.35" "92.92" 
+0

這真的很整齊 - 請問什麼是「\\ d + \\。\\ d +」?猜測「。」是小數點,但\\ d +做什麼? – user3919790

+0

'\\ d +'匹配一個或多個數字。即[0-9]。其中'\\ d'代表數字,而'+'代表一次或多次出現。 – Psidom

1

嘗試這種情況:

cbind(
matrix(as.numeric(unlist(strsplit(unlist(strsplit(gsub("]","", 
      dat$Sensitivity), ",")),"\\["))),ncol=3,byrow = T) 
, 
matrix(as.numeric(unlist(strsplit(unlist(strsplit(gsub("]","", 
      dat$Specificity), ",")),"\\["))),ncol=3,byrow = T) 
) 

     [,1] [,2] [,3] [,4] [,5] [,6] 
[1,] 1.01 0.21 2.91 100.00 99.80 100.00 
[2,] 100.00 98.77 100.00 27.17 25.15 29.26 
[3,] 99.66 98.14 99.99 53.31 51.00 55.61 
[4,] 99.66 98.14 99.99 69.90 67.75 71.99 
[5,] 99.66 98.14 99.99 44.96 42.67 47.26 
[6,] 99.66 98.14 99.99 70.23 68.08 72.31 
[7,] 92.95 89.43 95.59 90.18 88.73 91.50 
[8,] 100.00 98.77 100.00 27.17 25.15 29.26 
[9,] 89.60 85.56 92.82 91.70 90.35 92.92 
0

下面是使用base R提取與該類型的數字部分作爲numeric

lst <- lapply(d1, function(x) read.csv(text=gsub("[][]", ", ", x), header=FALSE)[-4]) 
lst 
#$Sensitivity 
#  V1 V2  V3 
#1 1.01 0.21 2.91 
#2 100.00 98.77 100.00 
#3 99.66 98.14 99.99 
#4 99.66 98.14 99.99 
#5 99.66 98.14 99.99 
#6 99.66 98.14 99.99 
#7 92.95 89.43 95.59 
#8 100.00 98.77 100.00 
#9 89.60 85.56 92.82 

#$Specificity 
#  V1 V2  V3 
#1 100.00 99.80 100.00 
#2 27.17 25.15 29.26 
#3 53.31 51.00 55.61 
#4 69.90 67.75 71.99 
#5 44.96 42.67 47.26 
#6 70.23 68.08 72.31 
#7 90.18 88.73 91.50 
#8 27.17 25.15 29.26 
#9 91.70 90.35 92.92 

如果需要的選項,data.framelist的S可以是轉換爲單個data.framecbind ing

do.call(cbind, lst)