2011-09-01 40 views
2

我想從像重新安排在R上的數據:使用重塑()函數中的R - 從廣角到長

Patient ID,Episode Number,Admission Date (A),Admission Date (H),Admission Time (A),Admission Time (H) 
1,5,20/08/2011,21/08/2011,1200,1300 
2,6,21/08/2011,22/08/2011,1300,1400 
3,7,22/08/2011,23/08/2011,1400,1500 
4,8,23/08/2011,24/08/2011,1500,1600 

喜歡的東西:

Record Type,Patient ID,Episode Number,Admission Date,Admission Time 
H,1,5,20/08/2011,1200 
A,1,5,21/08/2011,1300 
H,2,6,21/08/2011,1300 
A,2,6,22/08/2011,1400 
H,3,7,22/08/2011,1400 
A,3,7,23/08/2011,1500 
H,4,8,23/08/2011,1500 
A,4,8,24/08/2011,1600 

(我使用CSV格式,因此使用它們作爲測試數據更容易)。

我試過重塑()函數和它種工作:

> reshape(foo, direction = "long", idvar = 1, varying = 3:dim(foo)[2], 
> sep = "..", timevar = "dataset") 
    Patient.ID Episode.Number dataset Admission.Date Admission.Time 
1.A.   1    5  A.  20/08/2011   1200 
2.A.   2    6  A.  21/08/2011   1300 
3.A.   3    7  A.  22/08/2011   1400 
4.A.   4    8  A.  23/08/2011   1500 
1.H.   1    5  H.  21/08/2011   1300 
2.H.   2    6  H.  22/08/2011   1400 
3.H.   3    7  H.  23/08/2011   1500 
4.H.   4    8  H.  24/08/2011   1600 

但它不是在正確的格式,我想(我要爲每一個「病人ID」,第一行是「H」第二行是「A」)。

此外,在擴展該讀取數據(其中有250+列)失敗:

> reshape(realdata, direction = "long", idvar = 1, varying = 
> 6:dim(foo)[2], sep = "..", timevar = "dataset") 
Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, : 
    'varying' arguments must be the same length 

我覺得一方面是因爲colnames樣子:

> colnames(foo) 
    [1] "Unique.Key"          
    [2] "Campus.Code"         
    [3] "UR"            
    [4] "Terminal.digit"         
    [5] "Admission.date..A."      
    [6] "Admission.date..H."      
    [7] "Admission.time..A."      
    [8] "Admission.time..H."  
    . 
    . 
    . 
[31] "Medicare.Number"        
[32] "Payor"           
[33] "Doctor.specialty"        
[34] "Clinic"  
    . 
    . 
    . 
[202] "Admission.Source..A."      
[203] "Admission.Source..H." 

即有是具有後綴的列之間的「常用列」(無後綴)(希望這有意義)。

+2

我不明白爲什麼'reshape'命令不正確。它看起來像它以您想要的格式提供數據,儘管行的順序不同。 (你可以用'order'函數輕鬆地改變行順序。) –

回答

0

您可能可以通過使用融合和投射或重塑來獲得您想要的內容,但是您正在尋找相當具體的東西,因此直接進行重塑可能會更簡單。 您可以將原始數據分爲兩個單獨的數據框(一個用於A,一個用於H),然後將它們粘合在一起。

下面的代碼適用於您的示例數據,但我也嘗試將它寫得足夠​​靈活,以便希望能夠在您的大型數據集上工作,只要列的名稱與..A一致。和..H。後綴。

#grab the common columns and the "A" columns 
#(by using grepl to find any column that doesn't end in ".H.") 
foo.a <- foo[,!grepl(x=colnames(foo),pattern = "\\.H\\.$")] 

#strip the "..A." from the end of the ".A." column names 
colnames(foo.a) <- sub(x=colnames(foo.a), 
        pattern="(.*)\\.\\.A\\.$", 
        rep = "\\1") 
foo.a$Record.Type <- "A" 

#grab the common columns and the "H" columns 
#(by using grepl to find any column that doesn't end in ".A.") 
foo.h <- foo[,!grepl(x=colnames(foo),pattern = "\\.A\\.$")] 

#strip the "..H." from the end of the "..H." column names 
colnames(foo.h) <- sub(x=colnames(foo.h), 
        pattern="(.*)\\.\\.H\\.$", 
        rep = "\\1") 
foo.h$Record.Type <- "H" 

#stick them back together 
new.foo <- rbind(foo.a,foo.h) 

#order by Patient.ID 
new.foo <- new.foo[with(new.foo,order(Patient.ID)),] 

#re-order the columns as you like 
new.foo <- new.foo[,c(1,2,5,3,4)] 

這給了我:

> new.foo 
    Patient.ID Episode.Number Record.Type Admission.Date Admission.Time 
1   1    5   A  20/08/2011   1200 
5   1    5   H  21/08/2011   1300 
2   2    6   A  21/08/2011   1300 
6   2    6   H  22/08/2011   1400 
3   3    7   A  22/08/2011   1400 
7   3    7   H  23/08/2011   1500 
4   4    8   A  23/08/2011   1500 
8   4    8   H  24/08/2011   1600 
1

的建議使用meltcast從 「重塑」(現dcast和家庭)(現爲 「reshape2」)包不會讓你到你正在尋找你的數據。特別是,如果您的最終目標是您描述的「半長」格式,則爲you'll need to do some additional processing

有你在你的問題提出了兩個問題:

首先是結果的排序。作爲@RichieCotton points out in his comment和@mac in his answer,撥打order()就足以解決該問題。

二是錯誤:

Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, : 
    'varying' arguments must be the same length 

這是因爲,你猜到了,有你的varying = 6:dim(foo)[2]選擇列表中不變化的列。

解決這個問題的一個簡單方法是使用grep來確定哪些列是變化的,並使用它來指定您的列而不是像您一樣使用(不正確)catchall。這裏有一個樣例:

set.seed(1) 
foo <- data.frame(Unique.Key = 1:4, Campus.Code = LETTERS[1:4], 
        Admission.Date..A = 11:14, Admission.Date..H = 21:24, 
        Medicare.Number = letters[1:4], Payor = letters[1:4], 
        Admission.Source..A = rnorm(4), 
        Admission.Source..H = rnorm(4)) 
foo 
# Unique.Key Campus.Code Admission.Date..A Admission.Date..H Medicare.Number 
# 1   1   A    11    21    a 
# 2   2   B    12    22    b 
# 3   3   C    13    23    c 
# 4   4   D    14    24    d 
# Payor Admission.Source..A Admission.Source..H 
# 1  a   -0.6264538   0.3295078 
# 2  b   0.1836433   -0.8204684 
# 3  c   -0.8356286   0.4874291 
# 4  d   1.5952808   0.7383247 

找出哪些列不同,並以此作爲你的varying參數:

varyingCols <- grep("\\.\\.A$|\\.\\.H$", names(foo)) 

out <- reshape(foo, direction = "long", idvar = "Unique.Key", 
       varying = varyingCols, sep = "..") 
out[order(out$Unique.Key, rev(out$time)), ] 
#  Unique.Key Campus.Code Medicare.Number Payor time Admission.Date Admission.Source 
# 1.H   1   A    a  a H    21  0.3295078 
# 1.A   1   A    a  a A    11  -0.6264538 
# 2.H   2   B    b  b H    22  -0.8204684 
# 2.A   2   B    b  b A    12  0.1836433 
# 3.H   3   C    c  c H    23  0.4874291 
# 3.A   3   C    c  c A    13  -0.8356286 
# 4.H   4   D    d  d H    24  0.7383247 
# 4.A   4   D    d  d A    14  1.5952808 

如果您的數據是小(不是很多列),你可以手動統計varying列的位置並指定向量。正如您已經注意到的,任何未在idvarvarying中指定的列都會得到適當的回收。

out <- reshape(foo, direction = "long", idvar = "Unique.Key", 
       varying = c(3, 4, 7, 8), sep = "..")