2017-10-16 29 views
1

我在R中使用一個很長的數據幀,但遇到了一些問題。我的數據幀實際上由兩個較小的數據幀組成。然後,我調整了從數月到數年的時間安排,以便兩者共享一個共同的時間表。在R中結合行

但是,我現在面臨的問題是,有時我有兩行具有相同的時間值(因此每個調查問卷只有一行),但是我希望每個時間變量只有一行。 (我附上了問題的圖片,這可能比我的解釋更具洞察力)請注意,在這一點上,我仍然希望數據框採用長格式,但只想擺脫「額外的行」 。

誰能告訴我該怎麼做?

附加頭代碼,其中nomem = ID,time.compressed = time,sel01-03 =第一個問卷的一部分,close_num和gener_sat =第二個問卷的一部分。

`

structure(list(nomem_encr = c(800009L, 800009L, 800009L, 800012L, 
800015L, 800015L), timeline.compressed = c(79, 79, 95, 79, 28, 
28), sel01 = c(NA, 6L, NA, NA, NA, 7L), sel02 = c(NA, 6L, NA, 
NA, NA, 7L), sel03 = c(NA, 3L, NA, NA, NA, 5L), sel04 = c(NA, 
6L, NA, NA, NA, 6L), close_num = c(1, NA, 0.2, 1, 0.8, NA), gener_sat = c(7L, 
NA, 7L, 8L, 7L, NA)), .Names = c("nomem_encr", "timeline.compressed", 
"sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat"), class = "data.frame", row.names = c(NA, 
6L)) 

`

https://i.stack.imgur.com/3p038.png

+0

你也可以提供樣本數據。使用'head'創建子集和'dput'向我們展示如何複製 – Olivia

+0

回覆您的第一條評論:我恐怕完全不瞭解您的意見。我猜想對於每一行,X變量都被回答或Y變量。然而,有時兩行具有相同的時間變量,即,X和Y變量同時被回答。我想要的是將這些行組合成一行,其中X和Y變量都被回答。 – Elisabeth

+0

我們如何知道你必須修剪哪些行? – jaySf

回答

0

使用reshape2和dplyr包

加載庫和數據:

library(reshape2) 
library(dplyr) 

x <- structure(
    list(
    nomem_encr = c(800009L, 800009L, 800009L, 800012L, 800015L, 800015L), 
    timeline.compressed = c(79, 79, 95, 79, 28, 28), 
    sel01 = c(NA, 6L, NA, NA, NA, 7L), 
    sel02 = c(NA, 6L, NA, NA, NA, 7L), 
    sel03 = c(NA, 3L, NA, NA, NA, 5L), 
    sel04 = c(NA, 6L, NA, NA, NA, 6L), 
    close_num = c(1, NA, 0.2, 1, 0.8, NA), 
    gener_sat = c(7L, NA, 7L, 8L, 7L, NA) 
), 
    .Names = c(
    "nomem_encr", "timeline.compressed", 
    "sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat" 
), 
    class = "data.frame", 
    row.names = c(NA, 6L) 
) 
x 

這是你的數據是什麼樣子:

nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat 
1  800009     79 NA NA NA NA  1.0   7 
2  800009     79  6  6  3  6  NA  NA 
3  800009     95 NA NA NA NA  0.2   7 
4  800012     79 NA NA NA NA  1.0   8 
5  800015     28 NA NA NA NA  0.8   7 
6  800015     28  7  7  5  6  NA  NA 

現在,我們將數據融入長型:

melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>% 
head(15) 

輸出:

nomem_encr timeline.compressed variable value 
1  800009     79 sel01 NA 
2  800009     79 sel01  6 
3  800009     95 sel01 NA 
4  800012     79 sel01 NA 
5  800015     28 sel01 NA 
6  800015     28 sel01  7 
7  800009     79 sel02 NA 
8  800009     79 sel02  6 
9  800009     95 sel02 NA 
10  800012     79 sel02 NA 
11  800015     28 sel02 NA 
12  800015     28 sel02  7 
13  800009     79 sel03 NA 
14  800009     79 sel03  3 
15  800009     95 sel03 NA 

如果我們投了熔化的數據框,默認行爲是計算我們對每件物品有多少條目:

melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>% 
    dcast(
    formula = nomem_encr + timeline.compressed ~ variable 
) 

輸出:

Aggregation function missing: defaulting to length 
    nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat 
1  800009     79  2  2  2  2   2   2 
2  800009     95  1  1  1  1   1   1 
3  800012     79  1  1  1  1   1   1 
4  800015     28  2  2  2  2   2   2 

我們有2項用於通過800009 79(使用nomem_encrtimeline.compressed作爲識別變數)所標識的項目。

我們可以改變默認的行爲別的東西像sum

melt(data = x, id.vars = c("nomem_encr", "timeline.compressed")) %>% 
    dcast(
    formula = nomem_encr + timeline.compressed ~ variable, 
    fun.aggregate = function(xs) sum(xs, na.rm = TRUE) 
) 

輸出:

nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat 
1  800009     79  6  6  3  6  1.0   7 
2  800009     95  0  0  0  0  0.2   7 
3  800012     79  0  0  0  0  1.0   8 
4  800015     28  7  7  5  6  0.8   7 
+0

這似乎工作。非常感謝! – Elisabeth

+0

更新:我只是注意到,當我使用這段代碼時,它返回零,一和occiasional兩個我的數據,而不是什麼實際值。我複製粘貼你的語法並將其應用於整個數據集。任何想法可能會出錯?此外,我得到這個錯誤:匯聚功能丟失:默認爲長度 – Elisabeth

+0

結構(列表(nomem_encr = C(800009L,800009L,800012L,800015L, 800015L,800015L),timeline.compressed = C(79,95,79,28 ,40, 52),sel01 = C(1L,0L,0L,1L,1L,0L),sel02 = C(1L,0L,0L, 1L,1L,0L),sel03 = C(1L,0L, 0L,1L,1L,0L),close_num = C(1L, 1L,1L,1L,1L,1L),gener_sat = C(1L,1L,1L,1L,1L,1L)),.Names = C( 「nomem_encr」, 「timeline.compressed」, 「sel01」, 「sel02」, 「sel03」, 「close_num」, 「gener_sat」),類= 「data.frame」,row.names = C(NA,6L )) – Elisabeth

0

您可以dplyr + tidyr做到這一點:

library(dplyr) 
library(tidyr) 

df %>% 
    group_by(nomem_encr, timeline.compressed) %>% 
    summarize_all(funs(sort(.)[1])) 

結果:

# A tibble: 4 x 8 
# Groups: nomem_encr [?] 
    nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat 
     <int>    <dbl> <int> <int> <int> <int>  <dbl>  <int> 
1  800009     79  6  6  3  6  1.0   7 
2  800009     95 NA NA NA NA  0.2   7 
3  800012     79 NA NA NA NA  1.0   8 
4  800015     28  7  7  5  6  0.8   7 

如果你想更換NA與零的,你可以做到以下幾點:

df %>% 
    group_by(nomem_encr, timeline.compressed) %>% 
    summarize_all(funs(sort(.)[1])) %>% 
    mutate_all(funs(replace(., is.na(.), 0))) 

結果:

# A tibble: 4 x 8 
# Groups: nomem_encr [3] 
    nomem_encr timeline.compressed sel01 sel02 sel03 sel04 close_num gener_sat 
     <int>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> 
1  800009     79  6  6  3  6  1.0   7 
2  800009     95  0  0  0  0  0.2   7 
3  800012     79  0  0  0  0  1.0   8 
4  800015     28  7  7  5  6  0.8   7 

數據:

df = structure(list(nomem_encr = c(800009L, 800009L, 800009L, 800012L, 
800015L, 800015L), timeline.compressed = c(79, 79, 95, 79, 28, 
28), sel01 = c(NA, 6L, NA, NA, NA, 7L), sel02 = c(NA, 6L, NA, 
NA, NA, 7L), sel03 = c(NA, 3L, NA, NA, NA, 5L), sel04 = c(NA, 
6L, NA, NA, NA, 6L), close_num = c(1, NA, 0.2, 1, 0.8, NA), gener_sat = c(7L, 
NA, 7L, 8L, 7L, NA)), .Names = c("nomem_encr", "timeline.compressed", 
"sel01", "sel02", "sel03", "sel04", "close_num", "gener_sat"), class = "data.frame", row.names = c(NA, 
6L))