熊貓數據框中高度傾斜的數字變量的良好裝倉功能

-1

您能否提出一個很好的函數來將給定的高度傾斜的數據裝箱到小於或等於所需的箱數，例如，如果我想將所有數據框中的數值變量分成10個分箱，因爲數據有一些高度偏斜的變量，例如只有5個可能值的離散變量，它應該將該變量分成5個分箱。我曾嘗試在熊貓中使用剪切函數，但結果並不樂觀。你能幫我找到一個很好的功能來做到這一點。熊貓數據框中高度傾斜的數字變量的良好裝倉功能

來源

2015-11-06 Sandeep

如果一個特定的列只能取特定值，您可以使用該系列的獨特（）方法來確定這個值，例如：

import pandas as pd 
import matplotlib 

data_series = pd.Series([0,1,2,2,2,1,1,1,0,0,0,0]) 
unique_vals = list(data_series.unique()) 
if len(unique_vals) > 0.95*(len(data_series)): 
    #almost all values are unique - plot a normal histogram 
    matplotlib.pyplot.hist(data_series) 
else: 
    #many non-unique values - put each discrete value in its own bin 
    #bins specifies the edges of the bins - need an extra edge to create a bin for the maximal value 
    bins = unique_vals + [max(unique_vals)+1] 
    fig = matplotlib.pyplot.hist(data_series,bins=bins)

這會產生一些奇怪的看着柱狀圖如果你有非常不均勻間隔的離散值。

更自然的方式繪製離散的情況下可能是一個柱狀圖，以便您可以使用value_counts（可能需要調整欄的寬度取決於您的離散值如何接近是，雖然）：

matplotlib.pyplot.bar(data_series.value_counts().index,data_series.value_counts())

來源

2015-11-07 14:22:13 danielstn

熊貓數據框中高度傾斜的數字變量的良好裝倉功能

回答

相關問題