自定義排序熊貓數據幀

我有一個（非常大）的表使用pandas.DataFrame。它包含來自文本的字數;該索引是單詞表：自定義排序熊貓數據幀

   one.txt third.txt two.txt 
a    1   1  0 
i    0   0  1 
is    1   1  1 
no    0   0  1 
not    0   1  0 
really   1   0  0 
sentence  1   1  1 
short   2   0  0 
think   0   0  1

我想對所有文本中單詞頻率的單詞表進行排序。因此，我可以輕鬆創建一個包含每個單詞的頻率總和的系列（使用單詞作爲索引）。但是我怎麼能在這個清單上排序呢？

一個簡單的方法是將列表添加到數據框中作爲列，對其進行排序然後將其刪除。出於性能原因，我想避免這種情況。

另外兩種方式被描述爲here，但是其中一個重複數據幀，這是因爲它的大小而引起的一個問題，另一個創建了一個新的索引，但我需要關於更靠後的單詞的信息。

來源

2013-10-05 fotis j

您可以計算出頻率並使用sort方法找到所需的索引順序。然後使用df.loc[order.index]來重新調整原來的數據框：

order = df.sum(axis=1).sort(inplace=False) 
result = df.loc[order.index]

例如，

import pandas as pd 

df = pd.DataFrame({ 
    'one.txt': [1, 0, 1, 0, 0, 1, 1, 2, 0], 
    'third.txt': [1, 0, 1, 0, 1, 0, 1, 0, 0], 
    'two.txt': [0, 1, 1, 1, 0, 0, 1, 0, 1]}, 
    index=['a', 'i', 'is', 'no', 'not', 'really', 'sentence', 'short', 'think']) 

order = df.sum(axis=1).sort(inplace=False, ascending=False) 
print(df.loc[order.index])

產量

  one.txt third.txt two.txt 
sentence  1   1  1 
is    1   1  1 
short   2   0  0 
a    1   1  0 
think   0   0  1 
really   1   0  0 
not    0   1  0 
no    0   0  1 
i    0   0  1

來源

2013-10-05 10:58:55 unutbu

這種解決方案不與當前版本的熊貓（0.16工作。 2）。我使用早期版本的相同數據對它進行了測試，因此我收集了一些最近在熊貓身上發生的變化而破壞了它。它會產生一個關鍵錯誤。 –

@fotisj：感謝您的警告。我修改了熊貓0.16.2的答案。 – unutbu

自定義排序熊貓數據幀

回答

相關問題