2015-12-10 82 views
1

我試圖有條件地重新初始化的對象Python TfidfVectorizer:有條件的重新初始化可能嗎?

可以說我有以下初始化

TfidfVectorizer(sublinear_tf=True , decode_error='ignore', analyzer='word', tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')) 

現在,我從他想要添加一些參數的用戶得到一個字典

d = {"stop_words":"english"} 

如何將字典參數添加到已初始化的對象?所以對象的最終版本將quivalent到

TfidfVectorizer(
          stop_words='english', 
          sublinear_tf=True , 
          decode_error='ignore', 
          analyzer='word', 
          tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')) 

我可以做

TfidfVectorizer(**d) 

將在保留先前初始化參數呢?我想在TfidfVectorizer中有一些默認設置,然後我希望用戶能夠選擇其餘的部分。

是這樣的可能嗎?

回答

1

可能出現使用set_params(),從這個小實驗set_params()get_params()

from sklearn.feature_extraction.text import TfidfVectorizer 

t = TfidfVectorizer() 

t.get_params() 
Out[23]: 
{'analyzer': u'word', 
'binary': False, 
'charset': None, 
'charset_error': None, 
'decode_error': u'strict', 
'dtype': numpy.int64, 
'encoding': u'utf-8', 
'input': u'content', 
'lowercase': True, 
'max_df': 1.0, 
'max_features': None, 
'min_df': 1, 
'ngram_range': (1, 1), 
'norm': u'l2', 
'preprocessor': None, 
'smooth_idf': True, 
'stop_words': None, 
'strip_accents': None, 
'sublinear_tf': False, 
'token_pattern': u'(?u)\\b\\w\\w+\\b', 
'tokenizer': None, 
'use_idf': True, 
'vocabulary': None} 

t.set_params(binary=True) 
Out[24]: 
TfidfVectorizer(analyzer=u'word', binary=True, charset=None, 
     charset_error=None, decode_error=u'strict', 
     dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', 
     lowercase=True, max_df=1.0, max_features=None, min_df=1, 
     ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True, 
     stop_words=None, strip_accents=None, sublinear_tf=False, 
     token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True, 
     vocabulary=None) 

t.set_params(smooth_idf=False) 
Out[25]: 
TfidfVectorizer(analyzer=u'word', binary=True, charset=None, 
     charset_error=None, decode_error=u'strict', 
     dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', 
     lowercase=True, max_df=1.0, max_features=None, min_df=1, 
     ngram_range=(1, 1), norm=u'l2', preprocessor=None, 
     smooth_idf=False, stop_words=None, strip_accents=None, 
     sublinear_tf=False, token_pattern=u'(?u)\\b\\w\\w+\\b', 
     tokenizer=None, use_idf=True, vocabulary=None) 

d = {"stop_words":"english"} 

t.set_params(**d) 
Out[27]: 
TfidfVectorizer(analyzer=u'word', binary=True, charset=None, 
     charset_error=None, decode_error=u'strict', 
     dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', 
     lowercase=True, max_df=1.0, max_features=None, min_df=1, 
     ngram_range=(1, 1), norm=u'l2', preprocessor=None, 
     smooth_idf=False, stop_words='english', strip_accents=None, 
     sublinear_tf=False, token_pattern=u'(?u)\\b\\w\\w+\\b', 
     tokenizer=None, use_idf=True, vocabulary=None) 

此外,source顯示.set_params()遍歷你給它的參數,可以離開休息不受影響:

def set_params(self, **params): 
    """Set the parameters of this estimator. 
    The method works on simple estimators as well as on nested objects 
    (such as pipelines). The former have parameters of the form 
    ``<component>__<parameter>`` so that it's possible to update each 
    component of a nested object. 
    Returns 
    ------- 
    self 
    """ 
    if not params: 
     # Simple optimisation to gain speed (inspect is slow) 
     return self 
    valid_params = self.get_params(deep=True) 
    for key, value in six.iteritems(params): 
     split = key.split('__', 1) 
     if len(split) > 1: 
      # nested objects case 
      name, sub_name = split 
      if name not in valid_params: 
       raise ValueError('Invalid parameter %s for estimator %s. ' 
           'Check the list of available parameters ' 
           'with `estimator.get_params().keys()`.' % 
           (name, self)) 
      sub_object = valid_params[name] 
      sub_object.set_params(**{sub_name: value}) 
     else: 
      # simple objects case 
      if key not in valid_params: 
       raise ValueError('Invalid parameter %s for estimator %s. ' 
           'Check the list of available parameters ' 
           'with `estimator.get_params().keys()`.' % 
           (key, self.__class__.__name__)) 
      setattr(self, key, value) 
    return self