熊貓數據框計算

我有一個相當複雜的數據幀，看起來像這樣：熊貓數據框計算

df = pd.DataFrame({'0': {('Total Number of End Points', '0.01um', '0hr'): 12, 
    ('Total Number of End Points', '0.1um', '0hr'): 8, 
    ('Total Number of End Points', 'Control', '0hr'): 4, 
    ('Total Number of End Points', '0.01um', '24hr'): 18, 
    ('Total Number of End Points', '0.1um', '24hr'): 12, 
    ('Total Number of End Points', 'Control', '24hr'): 6, 
    ('Total Vessel Length', '0.01um', '0hr'): 12, 
    ('Total Vessel Length', '0.1um', '0hr'): 8, 
    ('Total Vessel Length', 'Control', '0hr'): 4, 
    ('Total Vessel Length', '0.01um', '24hr'): 18, 
    ('Total Vessel Length', '0.1um', '24hr'): 12, 
    ('Total Vessel Length', 'Control', '24hr'): 6}, 
    '1': {('Total Number of End Points', '0.01um', '0hr'): 12, 
    ('Total Number of End Points', '0.1um', '0hr'): 8, 
    ('Total Number of End Points', 'Control', '0hr'): 4, 
    ('Total Number of End Points', '0.01um', '24hr'): 18, 
    ('Total Number of End Points', '0.1um', '24hr'): 12, 
    ('Total Number of End Points', 'Control', '24hr'): 6, 
    ('Total Vessel Length', '0.01um', '0hr'): 12, 
    ('Total Vessel Length', '0.1um', '0hr'): 8, 
    ('Total Vessel Length', 'Control', '0hr'): 4, 
    ('Total Vessel Length', '0.01um', '24hr'): 18, 
    ('Total Vessel Length', '0.1um', '24hr'): 12, 
    ('Total Vessel Length', 'Control', '24hr'): 6}, 
    '2': {('Total Number of End Points', '0.01um', '0hr'): 12, 
    ('Total Number of End Points', '0.1um', '0hr'): 8, 
    ('Total Number of End Points', 'Control', '0hr'): 4, 
    ('Total Number of End Points', '0.01um', '24hr'): 18, 
    ('Total Number of End Points', '0.1um', '24hr'): 12, 
    ('Total Number of End Points', 'Control', '24hr'): 6, 
    ('Total Vessel Length', '0.01um', '0hr'): 12, 
    ('Total Vessel Length', '0.1um', '0hr'): 8, 
    ('Total Vessel Length', 'Control', '0hr'): 4, 
    ('Total Vessel Length', '0.01um', '24hr'): 18, 
    ('Total Vessel Length', '0.1um', '24hr'): 12, 
    ('Total Vessel Length', 'Control', '24hr'): 6}}) 

print(df) 
               0 1 2 
     Total Number of End Points 0.01um 0hr 12 12 12 
              24hr 18 18 18 
            0.1um 0hr 8 8 8 
              24hr 12 12 12 
            Control 0hr 4 4 4 
              24hr 6 6 6 
     Total Vessel Length  0.01um 0hr 12 12 12 
              24hr 18 18 18 
            0.1um 0hr 8 8 8 
              24hr 12 12 12 
            Control 0hr 4 4 4 
              24hr 6 6 6

我試圖通過相應的控制水平平均列來劃分每個值。我嘗試了以下，但它沒有奏效。

df2 = df.divide(df.xs('Control', level=1).mean(axis=1), axis='index')

我對Python和熊貓很新，所以我傾向於用MS Excel術語思考這個問題。

如果它是在Excel中爲A1的式（ '0.01um'， '0HR' '的終點總數'，0）將看起來是：

=A1/AVERAGE($A$5:$C$5)

B1（「總的終點， '0.01um'， '0HR號碼'，1）將是：

=B1/AVERAGE($A$5:$C$5)

和A2（ '終點'， '0.01um'， '24小時'，0的總數）將是

=A1/AVERAGE($A$6:$C$6)

這個例子的期望的結果將是：

            0 1 2 
     Total Number of End Points 0.01um 0hr 3 3 3 
              24hr 3 3 3 
            0.1um 0hr 2 2 2 
              24hr 2 2 2 
            Control 0hr 1 1 1 
              24hr 1 1 1 
     Total Vessel Length  0.01um 0hr 3 3 3 
              24hr 3 3 3 
            0.1um 0hr 2 2 2 
              24hr 2 2 2 
            Control 0hr 1 1 1 
              24hr 1 1 1

注：有很多指標和列的真實數據。

來源

2015-04-17 agf1997

你能提供所需輸出的一個例子？ – Andrew

當我把你的數據放到DataFrame中時，它與你在print（df）中得到的不同。 df = ...和print（df）是兩個不同的DataFrame。您的打印（df）與上面的代碼無關。您的輸入欄爲['a'，'b']，但您的印刷欄爲[0,1,2]。你能否全部保持一致？謝謝。 –

@MarkGraph哎呀..你是對的..我會修復它。 – agf1997

這裏的問題是，熊貓的組織方式可以輕鬆計算列數，並且該問題需要從其他行中扣除一行中的平均值。熊貓的設計並非如此。

但是，您可以輕鬆地切換行和列與轉置.T，然後它可能更易於處理，事實上，控制手段是一個班輪。

>>> df.T[(u'Total Vessel Length', u'Control', u'0hr')].mean() 
4.0

這4.0來源於兩個4.0值在原始數據：

>>> df.T[(u'Total Vessel Length', u'Control', u'0hr')] 
a 4 
b 4

在這一點上，它看起來像for循環將會把這個問題的關心。

未經測試：

for primary in (u'Total Vessel Length',u'Total Number of End Points'): 
    for um in (u'0.01um',u'0.1um'): 
     for hours in (u'0hr',u'24hr'): 
      df.T[(primary,um,hours)]=df.T[(primary,um,hours)]/df.T[(primary, u'Control', hours)].mean()

注意，這不分割非控制列，但它很容易包括「控制」到UM循環。

UPDATE這不起作用，不知何故它不修改數據幀。現在，我不知道爲什麼。

但是你可以通過調用pd.DataFrame構造一個新的數據幀，這個dd 理解。

這似乎是工作...

import pandas as pd 

df = pd.DataFrame({'0': {('Total Number of End Points', '0.01um', '0hr'): 12, 
    ('Total Number of End Points', '0.1um', '0hr'): 8, 
    ('Total Number of End Points', 'Control', '0hr'): 4, 
    ('Total Number of End Points', '0.01um', '24hr'): 18, 
    ('Total Number of End Points', '0.1um', '24hr'): 12, 
    ('Total Number of End Points', 'Control', '24hr'): 6, 
    ('Total Vessel Length', '0.01um', '0hr'): 12, 
    ('Total Vessel Length', '0.1um', '0hr'): 8, 
    ('Total Vessel Length', 'Control', '0hr'): 4, 
    ('Total Vessel Length', '0.01um', '24hr'): 18, 
    ('Total Vessel Length', '0.1um', '24hr'): 12, 
    ('Total Vessel Length', 'Control', '24hr'): 6}, 
    '1': {('Total Number of End Points', '0.01um', '0hr'): 12, 
    ('Total Number of End Points', '0.1um', '0hr'): 8, 
    ('Total Number of End Points', 'Control', '0hr'): 4, 
    ('Total Number of End Points', '0.01um', '24hr'): 18, 
    ('Total Number of End Points', '0.1um', '24hr'): 12, 
    ('Total Number of End Points', 'Control', '24hr'): 6, 
    ('Total Vessel Length', '0.01um', '0hr'): 12, 
    ('Total Vessel Length', '0.1um', '0hr'): 8, 
    ('Total Vessel Length', 'Control', '0hr'): 4, 
    ('Total Vessel Length', '0.01um', '24hr'): 18, 
    ('Total Vessel Length', '0.1um', '24hr'): 12, 
    ('Total Vessel Length', 'Control', '24hr'): 6}, 
    '2': {('Total Number of End Points', '0.01um', '0hr'): 12, 
    ('Total Number of End Points', '0.1um', '0hr'): 8, 
    ('Total Number of End Points', 'Control', '0hr'): 4, 
    ('Total Number of End Points', '0.01um', '24hr'): 18, 
    ('Total Number of End Points', '0.1um', '24hr'): 12, 
    ('Total Number of End Points', 'Control', '24hr'): 6, 
    ('Total Vessel Length', '0.01um', '0hr'): 12, 
    ('Total Vessel Length', '0.1um', '0hr'): 8, 
    ('Total Vessel Length', 'Control', '0hr'): 4, 
    ('Total Vessel Length', '0.01um', '24hr'): 18, 
    ('Total Vessel Length', '0.1um', '24hr'): 12, 
    ('Total Vessel Length', 'Control', '24hr'): 6}}) 

print df 

df2 = pd.DataFrame({(primary,um,hours):df.T[(primary,um,hours)]/df.T[(primary,u'Control',hours)].mean() for primary in (u'Total Vessel Length',u'Total Number of End Points') for um in (u'0.01um',u'0.1um') for hours in (u'0hr',u'24hr')}) 

print df2.T

輸出

[email protected]:~/SO$ python ./r.py 
               0 1 2 
(Total Number of End Points, 0.01um, 0hr) 12 12 12 
(Total Number of End Points, 0.01um, 24hr) 18 18 18 
(Total Number of End Points, 0.1um, 0hr)  8 8 8 
(Total Number of End Points, 0.1um, 24hr) 12 12 12 
(Total Number of End Points, Control, 0hr) 4 4 4 
(Total Number of End Points, Control, 24hr) 6 6 6 
(Total Vessel Length, 0.01um, 0hr)   12 12 12 
(Total Vessel Length, 0.01um, 24hr)   18 18 18 
(Total Vessel Length, 0.1um, 0hr)    8 8 8 
(Total Vessel Length, 0.1um, 24hr)   12 12 12 
(Total Vessel Length, Control, 0hr)   4 4 4 
(Total Vessel Length, Control, 24hr)   6 6 6 

[12 rows x 3 columns] 
              0 1 2 
(Total Number of End Points, 0.01um, 0hr) 3 3 3 
(Total Number of End Points, 0.01um, 24hr) 3 3 3 
(Total Number of End Points, 0.1um, 0hr) 2 2 2 
(Total Number of End Points, 0.1um, 24hr) 2 2 2 
(Total Vessel Length, 0.01um, 0hr)   3 3 3 
(Total Vessel Length, 0.01um, 24hr)   3 3 3 
(Total Vessel Length, 0.1um, 0hr)   2 2 2 
(Total Vessel Length, 0.1um, 24hr)   2 2 2 

[8 rows x 3 columns]

來源

2015-04-18 00:49:33 Paul

我得到了和in一樣的結果。有什麼地方需要'inplace = True'嗎？ – agf1997

這裏也一樣。似乎很熟悉。我會環顧四周。 – Paul

也許有關。還在尋找。 http://stackoverflow.com/questions/17995328/changing-values-in-pandas-dataframe-doenst-work – Paul

它有助於在自己的列中的值Control。你可以做，使用unstack：

df.index.names = ['field', 'type', 'time'] 
df2 = df.unstack(['type']).swaplevel(0, 1, axis=1) 

# type       0.01um 0.1um Control 0.01um 0.1um Control \ 
#          0  0  0  1  1  1 
# field      time            
# Total Number of End Points 0hr  12  8  4  12  8  4 
#       24hr  18 12  6  18 12  6 
# Total Vessel Length  0hr  12  8  4  12  8  4 
#       24hr  18 12  6  18 12  6 

# type       0.01um 0.1um Control 
#          2  2  2 
# field      time      
# Total Number of End Points 0hr  12  8  4 
#       24hr  18 12  6 
# Total Vessel Length  0hr  12  8  4 
#       24hr  18 12  6

現在找到的每個控制的平均值：

ave = df2['Control'].mean(axis=1) 
# field      time 
# Total Number of End Points 0hr  4 
#        24hr 6 
# Total Vessel Length   0hr  4 
#        24hr 6 
# dtype: float64

如您所料，你可以使用df2.divide來計算期望的結果。請務必使用axis=0來告訴Pandas根據行索引匹配值（在df2和ave之間）。

result = df2.divide(ave, axis=0) 
# type       0.01um 0.1um Control 0.01um 0.1um Control \ 
#          0  0  0  1  1  1 
# field      time            
# Total Number of End Points 0hr  3  2  1  3  2  1 
#       24hr  3  2  1  3  2  1 
# Total Vessel Length  0hr  3  2  1  3  2  1 
#       24hr  3  2  1  3  2  1 

# type       0.01um 0.1um Control 
#          2  2  2 
# field      time      
# Total Number of End Points 0hr  3  2  1 
#       24hr  3  2  1 
# Total Vessel Length  0hr  3  2  1 
#       24hr  3  2  1

基本上存在着你所追求的價值觀。但是，如果要重新排列數據框看起來完全一樣，你貼出來，然後：

result = result.stack(['type']) 
result = result.reorder_levels(['field','type','time'], axis=0) 
result = result.reindex(df.index)

產生

          0 1 2 
field      type time   
Total Number of End Points 0.01um 0hr 3 3 3 
            24hr 3 3 3 
          0.1um 0hr 2 2 2 
            24hr 2 2 2 
          Control 0hr 1 1 1 
            24hr 1 1 1 
Total Vessel Length  0.01um 0hr 3 3 3 
            24hr 3 3 3 
          0.1um 0hr 2 2 2 
            24hr 2 2 2 
          Control 0hr 1 1 1 
            24hr 1 1 1

全部放在一起：

df.index.names = ['field', 'type', 'time'] 
df2 = df.unstack(['type']).swaplevel(0, 1, axis=1) 
ave = df2['Control'].mean(axis=1) 
result = df2.divide(ave, axis=0) 
result = result.stack(['type']) 
result = result.reorder_levels(['field','type','time'], axis=0) 
result = result.reindex(df.index)

來源

2015-04-18 01:27:12 unutbu

有趣。我沒有注意到索引可能是元組，並有所有這些關聯的方法。 – Paul

熊貓數據框計算

回答

相關問題