2017-02-03 44 views
2

我從來沒有使用sklearn中存在的增量式PCA,我對它的參數有些困惑,並且無法找到它們的好解釋。增量式PCA

我看到有一個在構造函數中batch_size,而且,使用partial_fit方法時,你可以再次通過只是你的數據的一部分,我發現了以下的方法:

n = df.shape[0] 
chunk_size = 100000 
iterations = n//chunk_size 

ipca = IncrementalPCA(n_components=40, batch_size=1000) 

for i in range(0, iterations): 
    ipca.partial_fit(df[i*chunk_size : (i+1)*chunk_size].values) 

ipca.partial_fit(df[iterations*chunk_size : n].values) 

現在,我不明白的是以下內容 - 使用部分擬合時,batch_size是否會扮演任何角色,或者根本不起作用?它們是如何相關的?此外,如果兩者都被考慮了,當想要增加精度的同時增加內存佔用(如果相反,減少精度降低的內存消耗),我應該如何正確更改它們的值?

回答

1

docs說:

的batch_size:int或無,

The number of samples to use for each batch. Only used when calling fit... 

此參數是不內partial_fit,其中所述批量的大小由控制使用(缺省值=無)用戶。

較大的批次會增加內存消耗,較小的會減少內存消耗。 這也是寫在docs:

該算法不斷內存的複雜性,的的batch_size的順序,因此可以用np.memmap文件,而無需將整個文件加載到內存中。

儘管一些檢查和參數啓發式,全fit -function看起來是這樣的:

for batch in gen_batches(n_samples, self.batch_size_): 
    self.partial_fit(X[batch], check_input=False) 
+0

好的,所以ti基本上和我手動做同樣的事情。謝謝你的幫助。 – Marko

0

下面是一些基於https://github.com/kevinhughes27/pyIPCA增量PCA碼這是CCIPCA方法的實現。

import scipy.sparse as sp 
import numpy as np 
from scipy import linalg as la 
import scipy.sparse as sps 
from sklearn import datasets 

class CCIPCA:  
    def __init__(self, n_components, n_features, amnesic=2.0, copy=True): 
     self.n_components = n_components 
     self.n_features = n_features 
     self.copy = copy 
     self.amnesic = amnesic 
     self.iteration = 0 
     self.mean_ = None 
     self.components_ = None 
     self.mean_ = np.zeros([self.n_features], np.float) 
     self.components_ = np.ones((self.n_components,self.n_features))/\ 
          (self.n_features*self.n_components) 

    def partial_fit(self, u): 
     n = float(self.iteration) 
     V = self.components_ 

     # amnesic learning params 
     if n <= int(self.amnesic): 
      w1 = float(n+2-1)/float(n+2)  
      w2 = float(1)/float(n+2)  
     else: 
      w1 = float(n+2-self.amnesic)/float(n+2)  
      w2 = float(1+self.amnesic)/float(n+2) 

     # update mean 
     self.mean_ = w1*self.mean_ + w2*u 

     # mean center u   
     u = u - self.mean_ 

     # update components 
     for j in range(0,self.n_components): 

      if j > n: pass    
      elif j == n: V[j,:] = u 
      else:  
       # update the components 
       V[j,:] = w1*V[j,:] + w2*np.dot(u,V[j,:])*u/la.norm(V[j,:]) 
       normedV = V[j,:]/la.norm(V[j,:]) 
       normedV = normedV.reshape((self.n_features, 1)) 
       u = u - np.dot(np.dot(u,normedV),normedV.T) 

     self.iteration += 1 
     self.components_ = V/la.norm(V) 

     return 

    def post_process(self):   
     self.explained_variance_ratio_ = np.sqrt(np.sum(self.components_**2,axis=1)) 
     idx = np.argsort(-self.explained_variance_ratio_) 
     self.explained_variance_ratio_ = self.explained_variance_ratio_[idx] 
     self.components_ = self.components_[idx,:] 
     self.explained_variance_ratio_ = (self.explained_variance_ratio_/\ 
              self.explained_variance_ratio_.sum()) 
     for r in range(0,self.components_.shape[0]): 
      d = np.sqrt(np.dot(self.components_[r,:],self.components_[r,:])) 
      self.components_[r,:] /= d 

您可以

進口大熊貓作爲PD,ccipca

df = pd.read_csv('iris.csv') 
df = np.array(df)[:,:4].astype(float) 
pca = ccipca.CCIPCA(n_components=2,n_features=4) 
S = 10 
print df[0, :] 
for i in range(150): pca.partial_fit(df[i, :]) 
pca.post_process() 

得到的特徵向量/值不會exaactly是相同批次PCA測試。結果是近似的,但它們很有用。