0

我想要推算特徵的平均值,但僅計算基於其他列中具有相同類別/標稱值的例子的平均值,而我是想知道這是否可以使用scikit-learn的Imputer類?這樣可以更容易地以這種方式添加到管道中。scikit-learn在另一個特徵的標稱值組內的特徵組的特徵平均值

例如:

使用從kaggle泰坦尼克號數據集:source

我怎麼會去歸咎於每pclass平均fare。其背後的思想是,不同班級的人在門票之間的成本差異很大。

更新:一些人討論後,我應該用這句話被「歸咎於平均類中」。

我看了下面的Vivek的註釋,當我得到時間去做我想要的東西時,它將構建一個通用的管道函數:)我對如何做到這一點有個很好的想法,並且當它是完了。

+1

您可以根據'pclass'拆分數據,爲它們計算'fare',然後再堆疊它們以創建完整的數據。 –

+0

謝謝@VivekKumar!我會考慮將其作爲我的管道的一部分 – TheJokersThief

+1

您可以查看[此示例](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero- feature-union-py)來獲得實現你自己的類的提示,這可以在管道中使用 –

回答

0

所以下面是我的問題的一個非常簡單的方法,只是爲了處理事物的手段。更強大的實現可能涉及利用scikit學習中的Imputer類,這意味着它也可以執行模式,中值等,並且在處理稀疏/密集矩陣方面會更好。

這是基於Vivek Kumar對原始問題的評論,建議將數據拆分爲堆棧並將其重新組裝。

import numpy as np 
from sklearn.base import BaseEstimator, TransformerMixin 

class WithinClassMeanImputer(BaseEstimator, TransformerMixin): 
    def __init__(self, replace_col_index, class_col_index = None, missing_values=np.nan): 
     self.missing_values = missing_values 
     self.replace_col_index = replace_col_index 
     self.y = None 
     self.class_col_index = class_col_index 

    def fit(self, X, y = None): 
     self.y = y 
     return self 

    def transform(self, X): 
     y = self.y 
     classes = np.unique(y) 
     stacks = [] 

     if len(X) > 1 and len(self.y) = len(X): 
      if(self.class_col_index == None): 
       # If we're using the dependent variable 
       for aclass in classes: 
        with_missing = X[(y == aclass) & 
             (X[:, self.replace_col_index] == self.missing_values)] 
        without_missing = X[(y == aclass) & 
              (X[:, self.replace_col_index] != self.missing_values)] 

        column = without_missing[:, self.replace_col_index] 
        # Calculate mean from examples without missing values 
        mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values]) 

        # Broadcast mean to all missing values 
        with_missing[:, self.replace_col_index] = mean 

        stacks.append(np.concatenate((with_missing, without_missing))) 
      else: 
       # If we're using nominal values within a binarised feature (i.e. the classes 
       # are unique values within a nominal column - e.g. sex) 
       for aclass in classes: 
        with_missing = X[(X[:, self.class_col_index] == aclass) & 
             (X[:, self.replace_col_index] == self.missing_values)] 
        without_missing = X[(X[:, self.class_col_index] == aclass) & 
              (X[:, self.replace_col_index] != self.missing_values)] 

        column = without_missing[:, self.replace_col_index] 
        # Calculate mean from examples without missing values 
        mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values]) 

        # Broadcast mean to all missing values 
        with_missing[:, self.replace_col_index] = mean 
        stacks.append(np.concatenate((with_missing, without_missing))) 

      if len(stacks) > 1 : 
       # Reassemble our stacks of values 
       X = np.concatenate(stacks) 

     return X