scikit-learn在另一個特徵的標稱值組內的特徵組的特徵平均值

我想要推算特徵的平均值，但僅計算基於其他列中具有相同類別/標稱值的例子的平均值，而我是想知道這是否可以使用scikit-learn的Imputer類？這樣可以更容易地以這種方式添加到管道中。scikit-learn在另一個特徵的標稱值組內的特徵組的特徵平均值

例如：

使用從kaggle泰坦尼克號數據集：source

我怎麼會去歸咎於每pclass平均fare。其背後的思想是，不同班級的人在門票之間的成本差異很大。

更新：一些人討論後，我應該用這句話被「歸咎於平均類中」。

我看了下面的Vivek的註釋，當我得到時間去做我想要的東西時，它將構建一個通用的管道函數:)我對如何做到這一點有個很好的想法，並且當它是完了。

來源

2017-03-10 TheJokersThief

您可以根據'pclass'拆分數據，爲它們計算'fare'，然後再堆疊它們以創建完整的數據。 –

謝謝@VivekKumar！我會考慮將其作爲我的管道的一部分 – TheJokersThief

您可以查看[此示例]（http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero- feature-union-py）來獲得實現你自己的類的提示，這可以在管道中使用 –

所以下面是我的問題的一個非常簡單的方法，只是爲了處理事物的手段。更強大的實現可能涉及利用scikit學習中的Imputer類，這意味着它也可以執行模式，中值等，並且在處理稀疏/密集矩陣方面會更好。

這是基於Vivek Kumar對原始問題的評論，建議將數據拆分爲堆棧並將其重新組裝。

import numpy as np 
from sklearn.base import BaseEstimator, TransformerMixin 

class WithinClassMeanImputer(BaseEstimator, TransformerMixin): 
    def __init__(self, replace_col_index, class_col_index = None, missing_values=np.nan): 
     self.missing_values = missing_values 
     self.replace_col_index = replace_col_index 
     self.y = None 
     self.class_col_index = class_col_index 

    def fit(self, X, y = None): 
     self.y = y 
     return self 

    def transform(self, X): 
     y = self.y 
     classes = np.unique(y) 
     stacks = [] 

     if len(X) > 1 and len(self.y) = len(X): 
      if(self.class_col_index == None): 
       # If we're using the dependent variable 
       for aclass in classes: 
        with_missing = X[(y == aclass) & 
             (X[:, self.replace_col_index] == self.missing_values)] 
        without_missing = X[(y == aclass) & 
              (X[:, self.replace_col_index] != self.missing_values)] 

        column = without_missing[:, self.replace_col_index] 
        # Calculate mean from examples without missing values 
        mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values]) 

        # Broadcast mean to all missing values 
        with_missing[:, self.replace_col_index] = mean 

        stacks.append(np.concatenate((with_missing, without_missing))) 
      else: 
       # If we're using nominal values within a binarised feature (i.e. the classes 
       # are unique values within a nominal column - e.g. sex) 
       for aclass in classes: 
        with_missing = X[(X[:, self.class_col_index] == aclass) & 
             (X[:, self.replace_col_index] == self.missing_values)] 
        without_missing = X[(X[:, self.class_col_index] == aclass) & 
              (X[:, self.replace_col_index] != self.missing_values)] 

        column = without_missing[:, self.replace_col_index] 
        # Calculate mean from examples without missing values 
        mean = np.mean(column[without_missing[:, self.replace_col_index] != self.missing_values]) 

        # Broadcast mean to all missing values 
        with_missing[:, self.replace_col_index] = mean 
        stacks.append(np.concatenate((with_missing, without_missing))) 

      if len(stacks) > 1 : 
       # Reassemble our stacks of values 
       X = np.concatenate(stacks) 

     return X

來源

2017-03-15 23:31:14 TheJokersThief

scikit-learn在另一個特徵的標稱值組內的特徵組的特徵平均值

回答

相關問題