2015-05-08 31 views
8

pandas cut() documentation指出:「在結果分類對象中超出邊界值將是NA」。當上限不一定明確或重要時,這會造成困難。例如:用無限上/下限切割的熊貓

cut (weight, bins=[10,50,100,200]) 

會產生箱:

[(10, 50] < (50, 100] < (100, 200]] 

所以cut (250, bins=[10,50,100,200])會產生NaN,如將cut (5, bins=[10,50,100,200])。我想要做的是爲第一個例子生成類似> 200的東西,而爲第二個例子生成類似< 10的東西。

我意識到我可以做cut (weight, bins=[float("inf"),10,50,100,200,float("inf")])或等效,但是我所遵循的報告風格不允許像(200, inf]這樣的東西。我也意識到,我實際上可以通過cut()上的labels參數指定自定義標籤,但這意味着記得在每次調整bins時調整它們,這可能是經常發生的。

我是否用盡了所有可能性,或者cut()pandas的其他地方有什麼可以幫助我做到這一點?我正在考慮爲cut()編寫一個包裝函數,它會自動生成所需格式的標籤,但我想首先在這裏查看。

+1

您是否在問如何設置垃圾桶邊界,或者如何將其標記爲「200+」?你可以將上邊界設置爲'the_data.max()+ 1',但是我認爲如果你需要特定的格式,你必須手動設置標籤。 – BrenBarn

+0

是的,我開始認爲這是唯一的方法。 –

回答

4

等了幾天後,仍然沒有答案 - 我認爲這可能是因爲除了編寫cut()包裝函數之外,真的沒有辦法解決這個問題。我在此發佈我的版本並將問題標記爲已回答。如果有新的答案出現,我會改變這一點。

def my_cut (x, bins, 
      lower_infinite=True, upper_infinite=True, 
      **kwargs): 
    r"""Wrapper around pandas cut() to create infinite lower/upper bounds with proper labeling. 

    Takes all the same arguments as pandas cut(), plus two more. 

    Args : 
     lower_infinite (bool, optional) : set whether the lower bound is infinite 
      Default is True. If true, and your first bin element is something like 20, the 
      first bin label will be '<= 20' (depending on other cut() parameters) 
     upper_infinite (bool, optional) : set whether the upper bound is infinite 
      Default is True. If true, and your last bin element is something like 20, the 
      first bin label will be '> 20' (depending on other cut() parameters) 
     **kwargs : any standard pandas cut() labeled parameters 

    Returns : 
     out : same as pandas cut() return value 
     bins : same as pandas cut() return value 
    """ 

    # Quick passthru if no infinite bounds 
    if not lower_infinite and not upper_infinite: 
     return pd.cut(x, bins, **kwargs) 

    # Setup 
    num_labels  = len(bins) - 1 
    include_lowest = kwargs.get("include_lowest", False) 
    right   = kwargs.get("right", True) 

    # Prepend/Append infinities where indiciated 
    bins_final = bins.copy() 
    if upper_infinite: 
     bins_final.insert(len(bins),float("inf")) 
     num_labels += 1 
    if lower_infinite: 
     bins_final.insert(0,float("-inf")) 
     num_labels += 1 

    # Decide all boundary symbols based on traditional cut() parameters 
    symbol_lower = "<=" if include_lowest and right else "<" 
    left_bracket = "(" if right else "[" 
    right_bracket = "]" if right else ")" 
    symbol_upper = ">" if right else ">=" 

    # Inner function reused in multiple clauses for labeling 
    def make_label(i, lb=left_bracket, rb=right_bracket): 
     return "{0}{1}, {2}{3}".format(lb, bins_final[i], bins_final[i+1], rb) 

    # Create custom labels 
    labels=[] 
    for i in range(0,num_labels): 
     new_label = None 

     if i == 0: 
      if lower_infinite: 
       new_label = "{0} {1}".format(symbol_lower, bins_final[i+1]) 
      elif include_lowest: 
       new_label = make_label(i, lb="[") 
      else: 
       new_label = make_label(i) 
     elif upper_infinite and i == (num_labels - 1): 
      new_label = "{0} {1}".format(symbol_upper, bins_final[i]) 
     else: 
      new_label = make_label(i) 

     labels.append(new_label) 

    # Pass thru to pandas cut() 
    return pd.cut(x, bins_final, labels=labels, **kwargs) 
+1

太好了!你有沒有爲下一個Pandas版本提出你的代碼? – Manuel

+0

哇,我從來沒有想過這樣做。我想我會嘗試 - 謝謝! –

6

您可以使用float("inf")作爲上限和-float("inf")爲下界垃圾箱列表。它將刪除NaN值。