scikit-learn pipeline中鎖定步驟（防止重新安裝）

有沒有一種方便的機制來鎖定scikit-learn管道中的步驟以防止它們在pipeline.fit（）上重新定位？例如：scikit-learn pipeline中鎖定步驟（防止重新安裝）

import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.pipeline import Pipeline 
from sklearn.datasets import fetch_20newsgroups 

data = fetch_20newsgroups(subset='train') 
firsttwoclasses = data.target<=1 
y = data.target[firsttwoclasses] 
X = np.array(data.data)[firsttwoclasses] 

pipeline = Pipeline([ 
    ("vectorizer", CountVectorizer()), 
    ("estimator", LinearSVC()) 
]) 

# fit intial step on subset of data, perhaps an entirely different subset 
# this particular example would not be very useful in practice 
pipeline.named_steps["vectorizer"].fit(X[:400]) 
X2 = pipeline.named_steps["vectorizer"].transform(X) 

# fit estimator on all data without refitting vectorizer 
pipeline.named_steps["estimator"].fit(X2, y) 
print(len(pipeline.named_steps["vectorizer"].vocabulary_)) 

# fitting entire pipeline refits vectorizer 
# is there a convenient way to lock the vectorizer without doing the above? 
pipeline.fit(X, y) 
print(len(pipeline.named_steps["vectorizer"].vocabulary_))

我能想到這樣做的，沒有中間的轉換是定義一個定製估計類（如看到here）的唯一途徑，其擬合方法不執行任何操作，其變換方法是改造前的-fit變壓器。這是唯一的方法嗎？

來源

2017-02-09 dood

查看代碼，在管道對象中似乎沒有任何功能，如下所示：在管道上調用.fit（）將導致每個舞臺上的.fit（）。

最好的快速和骯髒的解決方案，我能想出是猴子補丁程階段的配件功能：

pipeline.named_steps["vectorizer"].fit(X[:400]) 
# disable .fit() on the vectorizer step 
pipeline.named_steps["vectorizer"].fit = lambda self, X, y=None: self 
pipeline.named_steps["vectorizer"].fit_transform = model.named_steps["vectorizer"].transform 

pipeline.fit(X, y)

來源

2017-06-22 19:50:24

scikit-learn pipeline中鎖定步驟（防止重新安裝）

回答

相關問題