2012-01-23 45 views
1

我想用多個變量(實際上只是2)實現線性迴歸。我正在使用ML級斯坦福大學的數據。我爲它的單個可變情況正確工作。相同的代碼應該已經爲多個工作,但是,不。多變量線性迴歸 - Python - 實現問題

LINK到數據:

http://s3.amazonaws.com/mlclass-resources/exercises/mlclass-ex1.zip

特徵規格化:

''' This is for the regression with multiple variables problem . You have to normalize features before doing anything. Lets get started''' 
from __future__ import division 
import os,sys 
from math import * 

def mean(f,col): 
    #This is to find the mean of a feature 
    sigma = 0 
    count = 0 
    data = open(f,'r') 
    for line in data: 
     points = line.split(",") 
     sigma = sigma + float(points[col].strip("\n")) 
     count+=1 
    data.close() 
    return sigma/count 
def size(f): 
    count = 0 
    data = open(f,'r') 

    for line in data: 
     count +=1 
    data.close() 
    return count 
def standard_dev(f,col): 
    #Calculate the standard_dev . Formula : Sqrt (Sigma (x - x') ** (x-x'))/N) 
    data = open(f,'r') 
    sigma = 0 
    mean = 0 
    if(col==0): 
     mean = mean_area 
    else: 
     mean = mean_bedroom 
    for line in data: 
     points = line.split(",") 
     sigma = sigma + (float(points[col].strip("\n")) - mean) ** 2 
    data.close() 
    return sqrt(sigma/SIZE) 

def substitute(f,fnew): 
    ''' Take the old file. 
     1. Subtract the mean values from each feature 
     2. Scale it by dividing with the SD 
    ''' 
    data = open(f,'r') 
    data_new = open(fnew,'w') 
    for line in data: 
     points = line.split(",") 
     new_area = (float(points[0]) - mean_area)/sd_area 
     new_bedroom = (float(points[1].strip("\n")) - mean_bedroom)/sd_bedroom 
     data_new.write("1,"+str(new_area)+ ","+str(new_bedroom)+","+str(points[2].strip("\n"))+"\n") 
    data.close() 
    data_new.close() 
global mean_area 
global mean_bedroom 
mean_bedroom = mean(sys.argv[1],1) 
mean_area = mean(sys.argv[1],0) 
print 'Mean number of bedrooms',mean_bedroom 
print 'Mean area',mean_area 
global SIZE 
SIZE = size(sys.argv[1]) 
global sd_area 
global sd_bedroom 
sd_area = standard_dev(sys.argv[1],0) 
sd_bedroom=standard_dev(sys.argv[1],1) 
substitute(sys.argv[1],sys.argv[2]) 

我已經實現平均值和標準偏差中的代碼,而不是使用NumPy的/ SciPy的。在文件中存儲的值後,該快照如下:

X1 X2 X3 COST OF HOUSE

1,0.131415422021,-0.226093367578,399900 
1,-0.509640697591,-0.226093367578,329900 
1,0.507908698618,-0.226093367578,369000 
1,-0.743677058719,-1.5543919021,232000 
1,1.27107074578,1.10220516694,539900 
1,-0.0199450506651,1.10220516694,299900 
1,-0.593588522778,-0.226093367578,314900 
1,-0.729685754521,-0.226093367578,198999 
1,-0.789466781548,-0.226093367578,212000 
1,-0.644465992588,-0.226093367578,242500 

我在其上運行的迴歸,找到參數。對於該代碼是下面:

''' The plan is to rewrite and this time, calculate cost each time to ensure its reducing. Also make it enough to handle multiple variables ''' 
from __future__ import division 
import os,sys 

def computecost(X,Y,theta): 
    #X is the feature vector, Y is the predicted variable 
    h_theta=calculatehTheta(X,theta) 
    delta = (h_theta - Y) * (h_theta - Y) 
    return (1/194) * delta 



def allCost(f,no_features): 
    theta=[0,0] 
    sigma=0 
    data = open(f,'r') 
    for line in data: 
     X=[] 
     Y=0 
     points=line.split(",") 
     for i in range(no_features): 
      X.append(float(points[i])) 
     Y=float(points[no_features].strip("\n")) 
     sigma=sigma+computecost(X,Y,theta) 
    return sigma 

def calculatehTheta(points,theta): 
    #This takes a file which has (1,feature1,feature2,so ... on) 
    #print 'Points are',points 
    sigma = 0 
    for i in range(len(theta)): 

     sigma = sigma + theta[i] * float(points[i]) 
    return sigma 



def gradient_Descent(f,no_iters,no_features,theta): 
    ''' Calculate (h(x) - y) * xj(i) . And then subtract it from thetaj . Continue for 1500 iterations and you will have your answer''' 


    X=[] 
    Y=0 
    sigma=0 
    alpha=0.01 
    for i in range(no_iters): 
     for j in range(len(theta)): 
      data = open(f,'r') 
      for line in data: 
       points=line.split(",") 
       for i in range(no_features): 
        X.append(float(points[i])) 
       Y=float(points[no_features].strip("\n")) 
       h_theta = calculatehTheta(points,theta) 
       delta = h_theta - Y 
       sigma = sigma + delta * float(points[j]) 
      data.close() 
      theta[j] = theta[j] - (alpha/97) * sigma 

      sigma = 0 
    print theta 

print allCost(sys.argv[1],2) 
print gradient_Descent(sys.argv[1],1500,2,[0,0,0]) 

它打印以下作爲參數:

[-3.8697149722857996e-14,0.02030369056348706,0.979706406501678]

所有三個是可怕的錯誤:(確切同樣的事情,可與單變量。

謝謝!

回答

2

全局變量和四重嵌套循環擔心我。那一個nd多次讀取和寫入數據到文件。

您的數據如此之大以至於難以適應內存?

爲什麼不使用csv模塊進行文件處理?

爲什麼不使用Numpy作爲數字部分?

不要重新發明輪子

假設你的數據條目行,你可以規範你的數據,並做了最小二乘法擬合在兩行:

normData = (data-data.mean(axis = 0))/data.std(axis = 0) 
c = numpy.dot(numpy.linalg.pinv(normData),prices) 

回覆評論從原來的海報

好吧,那麼唯一的其他建議,我可以給你然後是試圖把它分成小塊,所以更容易看到發生了什麼。並且更容易理智地檢查小部件。

這可能不是問題,但是您使用i作爲該四重循環中兩個循環的索引。通過將其切割成更小的範圍,可以避免這種問題。

我認爲自從我寫了一個明確的嵌套循環或者聲明一個全局變量以來已經有很多年了。

+0

數據確實很小。我不想使用NumPy並嘗試從頭開始實施,然後使用NumPy。因此。數據條目是事實上的行。如在文件中的行 – crazyaboutliv

+0

謝謝,我只會學習一些NumPy,而不是做它:) 2行相當酷,相比4循環:D – crazyaboutliv