2017-08-05 34 views
0

最後我想創造漂亮的​​,但要去那裏我需要顯示從A到B,B到C和B到A等的人數。數據準備顯示運輸中的號碼

我的數據集:

#Generate some sample data: 

proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE) 
#Sample dates 
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100) 
#Generate 20 hospital numbers in no particular order: 
Id<-sample(c("P43","P63","K52","G24","S55","D07","U87","P22","Y76","I92","P22","P02","U22415","U23","S14","O34","T62","J32","F63","T43"), 100, replace = TRUE) 
df<-data.frame(proc,dat,Id) 

如果我準備數據的Snakey的情節,我會做:

Sankey<-dcast(setDT(df)[, if(any(proc=="EMR"|proc=="RFA")) .SD, Id], Id~rowid(Id), value.var ="proc") 

這會給我一個很好的表顯示了在每個T會怎麼樣每個病人點按順序。

但是我想要進入下一步,即找出每種不同proc類型(即「EMR」,「RFA」和「活檢組織」)之間的患者數量,以便我可以得到它們成circlize希望,即格式(頻率由這裏)

origin destination frequency 
EMR  RFA   14 
EMR  Biopsies  4 
EMR  EMR   10 
RFA  RFA   24 
RFA  Biopsies  42 
RFA  EMR   1 
Biopsies RFA   3 
Biopsies Biopsies  6 
Biopsies EMR   16 

或我想以另一種方式展現這將是

   destination 
       EMR   RFA  Biopsies 
origin 
EMR   10    14   4 
RFA   1    24   42 
Biopsies  16    3   6 

回答

1

我會用dplyr該任務時,分析的核心是對r的lag函數檢索每個患者的最後位置,並計算病例的功能。

整個分析會做這樣的:


# for reproducibility 
set.seed(20170805) 

# your data 
proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE) 
#Sample dates 
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100) 
#Generate 20 hospital numbers in no particular order: 
Id<-sample(c("P43","P63","K52","G24","S55","D07","U87","P22","Y76","I92","P22","P02","U22415","U23","S14","O34","T62","J32","F63","T43"), 100, replace = TRUE) 

# my approach using dplyr 
library(dplyr) 
#> 
#> Attaching package: 'dplyr' 
#> The following objects are masked from 'package:stats': 
#> 
#>  filter, lag 
#> The following objects are masked from 'package:base': 
#> 
#>  intersect, setdiff, setequal, union 
df <- data_frame(proc, dat, Id) 

df %>% 
# make sure that we progress in the direct order of time... 
arrange(dat) %>% 
# for each patient: 
group_by(Id) %>% 
# find the last position 
mutate(origin = lag(proc, 1), destination = proc) %>% 
# for each origin, destination-pair... 
group_by(origin, destination) %>% 
# summarise the number of pairs 
summarise(n = n()) %>% 
# not really necessary, but gives a littlebit nicer output here... 
ungroup() 
#> # A tibble: 12 x 3 
#>  origin destination  n 
#>  <chr>  <chr> <int> 
#> 1 Biopsies Biopsies  5 
#> 2 Biopsies   EMR  8 
#> 3 Biopsies   RFA 11 
#> 4  EMR Biopsies 11 
#> 5  EMR   EMR 11 
#> 6  EMR   RFA 10 
#> 7  RFA Biopsies  6 
#> 8  RFA   EMR 12 
#> 9  RFA   RFA  8 
#> 10  <NA> Biopsies  8 
#> 11  <NA>   EMR  4 
#> 12  <NA>   RFA  6 
+0

確定@David獲得積分。我也想出了另一個答案 –

+0

當然,你也可以使用data.table或其他任何格式:) – David

0

我管理基本上是由所有列粘貼在一起,然後用stringr包使用一個狡猾的方式做到這一點分開然後製表。

library(stringr) 
Sankey<-dcast(setDT(df)[, if(any(proc=="EMR"|proc=="RFA")) .SD, Id], Id~rowid(Id), value.var ="proc") 

    Sankey$x <- apply(Sankey[ , 2:ncol(Sankey)] , 1 , paste , collapse = "-") 
    library(stringr) 
    myList<-unlist(str_extract_all(Sankey$x,"[A-Z|a-z]+-[A-Z|a-z]+")) 

table(myList) 
+0

雖然現在我想起它並不是那麼狡猾,因爲它給出了錯誤的結果hehe!非常適合創造性解決問題。 –