2017-06-05 19 views
0

我坐在一個數據幀,看起來像這樣的面前:龍寬與自動虛擬創建和多值列

 country year Indicator   a   b  c 
48996  US 2003  var1  NA  NA  NA 
16953  FR 1988  var2  NA 10664.920  NA 
22973  FR 1943  var3  NA 5774.334  NA 
8760  CN 1995  var4 8804.565  NA 12750.31 
47795  US 2012  var5  NA  NA  NA 
30033  GB 1969  var6  NA 29631.362  NA 
25796  FR 1921  var7  NA 14004.520  NA 
39534  NL 1941  var8  NA  NA  NA 
42255  NZ 1969  var8  NA  NA  NA 
7249  CN 1995  var9 50635.862  NA 75260.56 

我想要做的基本上是長廣轉型與Indicator爲關鍵變量。我通常會使用tidyr包中的spread()。然而,spread()可惜不接受多個值的列(在這種情況下abc),它並沒有完全做到我想要達到的目標:

  1. 使Indicator項新列
  2. 保持國家/年組合爲行
  3. 創造一個唯一的行從abc
  4. 創建的每一個「舊」值列名一個虛擬變量(即,每舊值B,C)

所以在最後,我的例子是中國的意見應該成爲

country year var1 [...] var4  [...] var9  dummy.a dummy.b dummy.c 
CN  1995 NA   8804.565   50635.862  1  0  0 
CN  1995 NA   12750.31   75260.56   0  0  1 

由於我原來的數據幀是58.162x119,我將不勝感激的東西,不包括大量的手動工作:-)

我希望我清楚我想達到的目標。謝謝你的幫助!


上面提到的數據框可以使用下面的代碼被複制:

structure(list(country = c("US", "FR", "FR", "CN", "US", "GB", 
"FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L, 
2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2", 
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", 
"var11", "var12", "var13", "var14", "var15", "var16", "var17", 
"var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733, 
NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219, 
5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA, 
NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946 
)), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L, 
16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L, 
7249L), class = "data.frame") 
+2

國際海事組織,這是一個非常糟糕的數據格式,但你可以有像'庫(data.table); (setDT(DF,keep.rownames = TRUE),id = c(「rn」,「country」,「year」,「Indicator」))[!is.na(value),dcast(.SD,country +年份+變量〜指標)] [,dcast(.SD,...〜variable,value.var =「variable」,fun = length)]' – Frank

+1

我認爲您根據輸入的預期是不正確的。例如,'1983年'的Var4應該是8804.565和12750.306 – akrun

+1

您使用'dput'提供的數據集與您的示例不同。例如,第4行是1983年還是1995年? – www

回答

2

這裏是我的解決方案:

require(tidyr) 
mydf <- structure(list(country = c("US", "FR", "FR", "CN", "US", "GB", 
    "FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L, 
    2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L, 
    2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2", 
    "var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", 
    "var11", "var12", "var13", "var14", "var15", "var16", "var17", 
    "var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733, 
    NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219, 
    5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA, 
    NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946 
    )), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L, 
    16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L, 
    7249L), class = "data.frame") 

mydf %>% gather(key=newIndicator,value=values, a,b,c) %>% filter(!is.na(values)) %>% spread(key=Indicator,values) %>% mutate(indicatorValues=1) %>% spread(newIndicator,indicatorValues,fill=0) 

輸出

# country year  var2  var3  var4  var6  var7  var9 a b c 
# 1  CN 1995  NA  NA 8804.565  NA  NA 50635.86 1 0 0 
# 2  CN 1995  NA  NA 12750.306  NA  NA 75260.56 0 0 1 
# 3  FR 1921  NA  NA  NA  NA 14004.52  NA 0 1 0 
# 4  FR 1943  NA 5774.334  NA  NA  NA  NA 0 1 0 
# 5  FR 1988 10664.92  NA  NA  NA  NA  NA 0 1 0 
# 6  GB 1969  NA  NA  NA 29631.36  NA  NA 0 1 0 
0

dt會喲你的原始數據。 dt2是最終輸出。

dt2 <- dt %>% 
    gather(Parameter, Value, a:c) %>% 
    spread(Indicator, Value) %>% 
    mutate(Data = ifelse(rowSums(is.na(.[, paste0("var", 1:9)])) != 9, 1, 0)) %>% 
    filter(Data != 0) %>% 
    spread(Parameter, Data, fill = 0) %>% 
    rename(dummy.a = a, dummy.b = b, dummy.c = c)