2015-06-02 43 views
1

爲什麼y最終成爲character類?似乎不應該從sqldf SUM發生?聚合之前:爲什麼這個GROUP BY和NA組合產生一個字符類型?

library(sqldf) 

# three very similar data.frame objects 
x <- structure(list(size = c(1L, 2L), diff = c(1, NA)) , .Names = c("gb","diff"), row.names = 1:2, class = "data.frame") 
y <- structure(list(size = c(1L, 1L, 2L, 2L), diff = c(NA, NA, 1, NA)) , .Names = c("gb","diff"), row.names = 1:4, class = "data.frame") 
z <- structure(list(size = c(2L, 2L, 1L, 1L), diff = c(NA, NA, 1, NA)) , .Names = c("gb","diff"), row.names = 1:4, class = "data.frame") 


# when summed in sqldf: numeric, character, numeric 
sapply(sqldf("select sum(diff) from x"),class) 
sapply(sqldf("select sum(diff) , gb from y group by gb"),class)[1] 
sapply(sqldf("select sum(diff) , gb from z group by gb"),class)[1] 



# this despite both being numeric originally 
class(x$diff) 
class(y$diff) 
+0

如果我刪除'和()'和預期(包括數字)只留下'diff'結果的。我不知道sql和所以總和。在R中,sum是元素總和的通用函數。也許''sqldf'函數裏面的R命令需要放在一個特殊的函數裏面? – SabDeM

+0

我認爲這適用於:sqldf常見問題解答14:https://github.com/ggrothendieck/sqldf#14-how-does-one-read-files-where-numeric-nas-are-represented-as-missing-empty-字段 –

+0

@BondedDust謝謝!這可能是問題的一部分。我提交了一個稍微澄清的問題,但我應該寫一個解決方法:) https://github.com/ggrothendieck/sqldf/issues/2 –

回答

3

排除NA,(NULL即):

out1 <- sqldf("SELECT SUM(diff) AS diff_sum 
       FROM x 
       WHERE diff IS NOT NULL") 

out2 <- sqldf("SELECT SUM(diff) AS diff_sum, gb 
       FROM y 
       WHERE diff IS NOT NULL 
       GROUP BY gb") 

str(out1) 
# 'data.frame': 1 obs. of 1 variable: 
# $ diff_sum: num 1 
str(out2) 
# 'data.frame': 1 obs. of 2 variables: 
# $ diff_sum: num 1 
# $ gb  : int 2 
1

This is the correct way to avoid this.

@ G.Grothendieck:

sqldf具有將 設置一個啓發式任何輸出列上的類都與輸入相同同名列,因此這將解決這個問題:

str(y) 
## 'data.frame': 4 obs. of 2 variables: 
## $ gb : int 1 1 2 2 
## $ diff: num NA NA 1 NA 

out1 <- sqldf("select sum(diff) diff, gb from y group by gb") 
str(out1) 
## 'data.frame': 2 obs. of 2 variables: 
## $ diff: num NA 1 
## $ gb : int 1 2 

out2 <- sqldf("select sum(diff) diff, gb from y group by gb ORDER BY gb desc") 
str(out2) 
## 'data.frame': 2 obs. of 2 variables: 
## $ diff: num 1 NA 
## $ gb : int 2 1 
+0

@ G.Grothendieck這是否意味着我們不能重命名輸出變量,例如'diff_sum'並將其作爲數字? – zx8754

相關問題