2015-01-10 39 views
4

如何使用Julia Dataframes進行分組和透視表?Julia Dataframe group by和數據透視表函數

可以說我有數據幀

using DataFrames 

df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"], 
       Class = ["H","L","H","L","L","H", "H","L","L","M"], 
       Address = ["12 Silver","10 Fak","12 Silver","1 North","10 Fak","2 Fake", "1 Red","1 Dog","2 Fake","1 White"], 
       Score = ["4","5","3","2","1","5","4","3","2","1"]) 

,我要做到以下幾點:

1)LocationClass這應該輸出

Class  H L M 
Location   
DC  0 0 1 
NY  2 1 0 
SF  1 2 0 
TX  1 2 0 

2)組數據透視表按「位置」和該組中記錄的數量計算,該組應該輸出

Pop 
DC 1 
NY 3 
SF 3 
TX 3 

回答

6

你可以使用unstack來獲得大部分的方法(DataFrames沒有索引,所以Class必須保持一列,而不是在pandas裏它將成爲索引),這似乎是DataFrames。 JL的回答pivot_table

julia> unstack(df, :Location, :Class, :Score) 
WARNING: Duplicate entries in unstack. 
4x4 DataFrames.DataFrame 
| Row | Class | H | L | M | 
|-----|-------|-----|-----|-----| 
| 1 | "DC" | NA | NA | "1" | 
| 2 | "NY" | "3" | "2" | NA | 
| 3 | "SF" | "5" | "1" | NA | 
| 4 | "TX" | "4" | "2" | NA | 

我不知道你怎麼在這裏fillna(拆散沒有這個選項)...

你可以使用by與的GROUPBY(行數)方法:

julia> by(df, :Location, nrow) 
4x2 DataFrames.DataFrame 
| Row | Location | x1 | 
|-----|----------|----| 
| 1 | "DC"  | 1 | 
| 2 | "NY"  | 3 | 
| 3 | "SF"  | 3 | 
| 4 | "TX"  | 3 | 
3

(1)這裏是我嘗試創建一個數據透視表。我用()按一列分組,然後計算函數中第二列因子的頻率。

# Create pivot table from DataFrame. 
# - df : DataFrame object 
# - column1 : Column symbol used for row labels. 
# - column2 : Column symbol used for column labels. 
function pivot_table(df, column1, column2) 
    # For given DataArray and factor list, create single row DataFrame: 
    # ---------------------------------------- 
    # | factor1 | factor2 | ... 
    # ---------------------------------------- 
    # |freq of factor1|freq of factor1| ... 
    # ---------------------------------------- 
    function frequency(data, factors) 
     # Convert factors to symbols. 
     factor_symbols::Vector{Symbol} = map(factor -> symbol(factor), factors) 

     # Convert frequency to fit the DataFrame constructor parameter type. 
     frequencies::Vector{Any} = map(frequency->[frequency], map(factor -> sum(data .== factor), factors)) 

     DataFrame(frequencies, factor_symbols) 
    end 

    factors = sort(unique(df[column2])) 
    by(df, column1, x -> frequency(x[column2], factors)) 
end 

實施例:

julia> pivot_table(df, :Location, :Class) 
4x4 DataFrames.DataFrame 
| Row | Location | H | L | M | 
|-----|----------|---|---|---| 
| 1 | "DC"  | 0 | 0 | 1 | 
| 2 | "NY"  | 2 | 1 | 0 | 
| 3 | "SF"  | 1 | 2 | 0 | 
| 4 | "TX"  | 1 | 2 | 0 | 

(2)可以通過和nrow使用。

julia> by(df, :Location, nrow) 
4x2 DataFrames.DataFrame 
| Row | Location | x1 | 
|-----|----------|----| 
| 1 | "DC"  | 1 | 
| 2 | "NY"  | 3 | 
| 3 | "SF"  | 3 | 
| 4 | "TX"  | 3 | 
0

對於你的問題的第2部分,您可以使用匿名函數,並返回一個數據幀,以命名新的列,例如count

julia> by(df, :Location, d -> DataFrame(count=nrow(d))) 
4x2 DataFrames.DataFrame 
| Row | Location | count | 
|-----|----------|-------| 
| 1 | "DC"  | 1  | 
| 2 | "NY"  | 3  | 
| 3 | "SF"  | 3  | 
| 4 | "TX"  | 3  | 
0

包FreqTable.jl解決了這樣的:

>using FreqTables 
>show(freqtable(df,:Location,:Class)) 

4×3 Named Array{Int64,2} 
Location ╲ Class │ H L M 
─────────────────┼──────── 
DC    │ 0 0 1 
NY    │ 2 1 0 
SF    │ 1 2 0 
TX    │ 1 2 0 
0

使用了this SO question開發的pivot (df, rowFields, colField, valuesField; <keyword arguments>)功能,你可以這樣做:

julia> df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"], 
         Class = ["H","L","H","L","L","H", "H","L","L","M"], 
         Address = ["12 Silver","10 Fak","12 Silver","1 North","10 Fak","2 Fake", "1 Red","1 Dog","2 Fake","1 White"], 
         Score = ["4","5","3","2","1","5","4","3","2","1"]) 

第一個問題:

julia> df_piv = pivot(df,[:Location],:Class,:Score,ops=length) 
julia> [df_piv[isna(df_piv[i]), i] = 0 for i in names(df_piv)] # remove NA values across whole df 
julia> df_piv 
4×4 DataFrames.DataFrame 
│ Row │ Location │ H │ L │ M │ 
├─────┼──────────┼───┼───┼───┤ 
│ 1 │ "DC"  │ 0 │ 0 │ 1 │ 
│ 2 │ "NY"  │ 2 │ 1 │ 0 │ 
│ 3 │ "SF"  │ 1 │ 2 │ 0 │ 
│ 4 │ "TX"  │ 1 │ 2 │ 0 │ 

第二個問題:

julia> df[:pop]="Pop" # add a dummy column with constant values 
julia> pivot(df,[:Location],:pop,:Score,ops=length) 
4×2 DataFrames.DataFrame 
│ Row │ Location │ Pop │ 
├─────┼──────────┼─────┤ 
│ 1 │ "DC"  │ 1 │ 
│ 2 │ "NY"  │ 3 │ 
│ 3 │ "SF"  │ 3 │ 
│ 4 │ "TX"  │ 3 │