如何混合型矩陣轉換爲數據幀中的朱莉婭承認列類型

DataFrames的一個很好的特性是，它可以存儲不同類型的列，它可以「自動識別」出來的，例如：如何混合型矩陣轉換爲數據幀中的朱莉婭承認列類型

using DataFrames, DataStructures 

df1 = wsv""" 
parName region forType    value 
vol  AL  broadL_highF  3.3055628012 
vol  AL  con_highF   2.1360975151 
vol  AQ  broadL_highF  5.81984502 
vol  AQ  con_highF   8.1462998309 
""" 
typeof(df1[:parName]) 
DataArrays.DataArray{String,1} 
typeof(df1[:value]) 
DataArrays.DataArray{Float64,1}

當我嘗試做不過來達到相同的結果從一開始矩陣（從電子表格導入）I「鬆」即自動轉換：

dataMatrix = [ 
    "parName" "region" "forType"  "value"; 
    "vol"  "AL"  "broadL_highF" 3.3055628012; 
    "vol"  "AL"  "con_highF"  2.1360975151; 
    "vol"  "AQ"  "broadL_highF" 5.81984502; 
    "vol"  "AQ"  "con_highF"  8.1462998309; 
] 
h = [Symbol(c) for c in dataMatrix[1,:]] 
vals = dataMatrix[2:end, :] 
df2 = convert(DataFrame,OrderedDict(zip(h,[vals[:,i] for i in 1:size(vals,2)]))) 

typeof(df2[:parName]) 
DataArrays.DataArray{Any,1} 
typeof(df2[:value]) 
DataArrays.DataArray{Any,1}

上有SO幾個問題關於如何將矩陣轉換爲數據框（例如DataFrame from Array with Header,Convert Julia array to dataframe），但沒有任何答案處理混合類型矩陣的轉換。

如何從矩陣自動識別列的類型創建一個數據框？（1）轉換df（使用字典或矩陣構造函數..第一個更快）然後應用try-catch進行類型轉換（我的原始答案）; （2）轉換爲字符串，然後使用df.inlinetable（丹Getz答案）; （3）檢查每個元素的類型和列的一致性（Alexander Morley答案）。

這些結果如下：

# second time for compilation.. further times ~ results 
@time toDf1(m) # 0.000946 seconds (336 allocations: 19.811 KiB) 
@time toDf2(m) # 0.000194 seconds (306 allocations: 17.406 KiB) 
@time toDf3(m) # 0.001820 seconds (445 allocations: 35.297 KiB)

那麼，瘋狂的是，最有效的解決方案似乎是「倒出來的水」問題縮小到一個已經解決了一個;-)

謝謝你所有的答案。

來源

2017-09-29 Antonello

你爲什麼不只是保存電子表格中的csv文件並使用CSV.read（）加載它？這應該照顧它。 –

@ MichaelK.Borregaard，因爲我有一個模型可以加載來自多張工作表的所有設置和數據，並且我希望在每次進行更改時都避免將它們全部導出到CVS。 – Antonello

另一種方法是重用工作解決方案，即將矩陣轉換爲適合DataFrames使用的字符串。在代碼中，這是：

using DataFrames 

dataMatrix = [ 
    "parName" "region" "forType"  "value"; 
    "vol"  "AL"  "broadL_highF" 3.3055628012; 
    "vol"  "AL"  "con_highF"  2.1360975151; 
    "vol"  "AQ"  "broadL_highF" 5.81984502; 
    "vol"  "AQ"  "con_highF"  8.1462998309; 
] 

s = join(
    [join([dataMatrix[i,j] for j in indices(dataMatrix, 2)] 
    , '\t') for i in indices(dataMatrix, 1)], '\n') 

df = DataFrames.inlinetable(s; separator='\t', header=true)

所得df具有由數據幀猜到其列類型。

無關，但這個答案讓我想起how a mathematician boils water joke。

來源

2017-09-29 15:11:38

我正在使用'writecsv '寫一個'IOBuffer'，然後''readtable'，讓它不停地取笑我。這讓人想起這一點，但我認爲更清潔。 –

@ MichaelK.Borregaard寫給IOBuffer也是這個解決方案的第一個版本（但是在擺脫尾隨選項卡時，編輯到這個版本讓我惱火）。 –

啊 - 無關 - 但你是如何獲得'readtable'來接受'IOBuffer'輸入的？我可以做'a = IOBuffer（）; writecsv（a，dataMatrix）; readcsv（take！（a），DataFrame））'，但我無法使用'CSV.read'或'readtable'工作。 –

-1

雖然我沒有找到一個完整的解決方案，部分之一，是嘗試將各列事後轉換：

""" 
    convertDf!(df) 

Try to convert each column of the converted df from Any to In64, Float64 or String (in that order).  
""" 
function convertDf!(df) 
    for c in names(df) 
     try 
      df[c] = convert(DataArrays.DataArray{Int64,1},df[c]) 
     catch 
      try 
       df[c] = convert(DataArrays.DataArray{Float64,1},df[c]) 
      catch 
       try 
        df[c] = convert(DataArrays.DataArray{String,1},df[c]) 
       catch 
       end 
      end 
     end 
    end 
end

儘管肯定不完整的，這是足以讓我的需求。

來源

2017-09-29 12:16:17 Antonello

對不起，但這太可怕了：快樂： –

@ MichaelK.Borregaard好..如果你有更好的解決方案...... ;-) – Antonello

@ MichaelK.Borregaard ..還有，如果你告訴我什麼是或爲什麼它是可怕的..我會學習;-) – Antonello

雖然我認爲可能有更好的方式去做所有事情，這應該做你想做的事情。

df = DataFrame() 
for (ind,s) in enumerate(Symbol.(dataMatrix[1,:])) # convert first row to symbols and iterate through them. 
    # check all types the same else assign to Any 
    T = typeof(dataMatrix[2,ind]) 
    T = all(typeof.(dataMatrix[2:end,ind]).==T) ? T : Any 
    # convert to type of second element then add to data frame 
    df[s] = T.(dataMatrix[2:end,ind]) 
end

來源

2017-09-29 14:37:15

mat2df(mat) = 
    DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))

似乎工作，且比@丹 - 蓋茨的答案（至少在這個數據矩陣）:)

using DataFrames, BenchmarkTools 

dataMatrix = [ 
    "parName" "region" "forType"  "value"; 
    "vol"  "AL"  "broadL_highF" 3.3055628012; 
    "vol"  "AL"  "con_highF"  2.1360975151; 
    "vol"  "AQ"  "broadL_highF" 5.81984502; 
    "vol"  "AQ"  "con_highF"  8.1462998309; 
] 

mat2df(mat) = 
    DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:])) 

function mat2dfDan(mat) 
    s = join([join([dataMatrix[i,j] for j in indices(dataMatrix, 2)], '\t') 
       for i in indices(dataMatrix, 1)],'\n') 

    DataFrames.inlinetable(s; separator='\t', header=true) 
end

更快 -

julia> @benchmark mat2df(dataMatrix) 

BenchmarkTools.Trial: 
    memory estimate: 5.05 KiB 
    allocs estimate: 75 
    -------------- 
    minimum time:  18.601 μs (0.00% GC) 
    median time:  21.318 μs (0.00% GC) 
    mean time:  31.773 μs (2.50% GC) 
    maximum time:  4.287 ms (95.32% GC) 
    -------------- 
    samples:   10000 
    evals/sample:  1 

julia> @benchmark mat2dfDan(dataMatrix) 

BenchmarkTools.Trial: 
    memory estimate: 17.55 KiB 
    allocs estimate: 318 
    -------------- 
    minimum time:  69.183 μs (0.00% GC) 
    median time:  81.326 μs (0.00% GC) 
    mean time:  90.284 μs (2.97% GC) 
    maximum time:  5.565 ms (93.72% GC) 
    -------------- 
    samples:   10000 
    evals/sample:  1

來源

2017-10-05 00:37:44 JobJob

如何混合型矩陣轉換爲數據幀中的朱莉婭承認列類型

回答

相關問題