2016-04-03 69 views
3

我的機器的GPU有2 GB的內存。當我第一次運行下面的代碼時,我沒有遇到任何錯誤。但是,第二次運行代碼時出現內存錯誤。作爲一種短時間的補救措施,我唯一能做的就是使用torch.Tensor.float()將數據轉換爲float32。但是,問題仍然存在,並且在完成該過程後佔用的內存不會被釋放,或者該過程在運行時被終止。機器RAM也是這種情況。應該如何防止Torch內存泄漏或釋放內存?如何處理火炬中的GPU內存泄漏問題?

require 'nn' 
require 'image' 
require 'cunn' 
require 'paths' 



collectgarbage(); collectgarbage() 
if (not paths.filep("cifar10torchsmall.zip")) then 
    os.execute('wget -c https://s3.amazonaws.com/torch7/data/cifar10torchsmall.zip') 
    os.execute('unzip cifar10torchsmall.zip') 
end 
trainset = torch.load('cifar10-train.t7') 
testset = torch.load('cifar10-test.t7') 
classes = {'airplane', 'automobile', 'bird', 'cat', 
      'deer', 'dog', 'frog', 'horse', 'ship', 'truck'} 

setmetatable(trainset, 
    {__index = function(t, i) 
        return {t.data[i], t.label[i]} 
       end} 
); 
trainset.data = trainset.data:double() -- convert the data from a ByteTensor to a DoubleTensor. 

function trainset:size() 
    return self.data:size(1) 
end 

mean = {} -- store the mean, to normalize the test set in the future 
stdv = {} -- store the standard-deviation for the future 
for i=1,3 do -- over each image channel 
    mean[i] = trainset.data[{ {}, {i}, {}, {} }]:mean() -- mean estimation 
    print('Channel ' .. i .. ', Mean: ' .. mean[i]) 
    trainset.data[{ {}, {i}, {}, {} }]:add(-mean[i]) -- mean subtraction 

    stdv[i] = trainset.data[{ {}, {i}, {}, {} }]:std() -- std estimation 
    print('Channel ' .. i .. ', Standard Deviation: ' .. stdv[i]) 
    trainset.data[{ {}, {i}, {}, {} }]:div(stdv[i]) -- std scaling 
end 


testset.data = testset.data:double() -- convert from Byte tensor to Double tensor 
for i=1,3 do -- over each image channel 
    testset.data[{ {}, {i}, {}, {} }]:add(-mean[i]) -- mean subtraction  
    testset.data[{ {}, {i}, {}, {} }]:div(stdv[i]) -- std scaling 
end 

trainset.data = trainset.data:cuda() 
testset.data = testset.data:cuda() 

net = nn.Sequential() 
net:add(nn.SpatialConvolution(3, 6, 5, 5)) -- 3 input image channels, 6 output channels, 5x5 convolution kernel 
net:add(nn.ReLU())      -- non-linearity 
net:add(nn.SpatialMaxPooling(2,2,2,2))  -- A max-pooling operation that looks at 2x2 windows and finds the max. 
net:add(nn.SpatialConvolution(6, 16, 5, 5)) 
net:add(nn.ReLU())      -- non-linearity 
net:add(nn.SpatialMaxPooling(2,2,2,2)) 
net:add(nn.View(16*5*5))     -- reshapes from a 3D tensor of 16x5x5 into 1D tensor of 16*5*5 
net:add(nn.Linear(16*5*5, 120))    -- fully connected layer (matrix multiplication between input and weights) 
net:add(nn.ReLU())      -- non-linearity 
net:add(nn.Linear(120, 84)) 
net:add(nn.ReLU())      -- non-linearity 
net:add(nn.Linear(84, 10))     -- 10 is the number of outputs of the network (in this case, 10 digits) 
net:add(nn.LogSoftMax()) 
net = net:cuda() 

criterion = nn.ClassNLLCriterion() 
criterion = criterion:cuda() 



pred = net:forward(trainset.data) 
outputEr = criterion:forward(pred, trainset.label:cuda()) 
net:zeroGradParameters() 
outputGrad = criterion:backward(pred, trainset.label:cuda()) 
collectgarbage() 
inputGrad = net:backward(trainset.data, outputGrad) 

方的問題:爲什麼火炬初始化網絡參數爲雙儘管GPU是在計算雙精度運算相當緩慢,實際上沒有必要爲幾乎所有的神經網絡應用64位的參數值?我如何使用float(32位)參數向量初始化模型?

我找到了側面問題的答案。您可以使用易於使火炬的默認數據類型爲浮動在你的代碼的開頭如下:

torch.setdefaulttensortype('torch.FloatTensor') 

回答

1

我可以從CUDA 6.5升級到CUDA 7.5,我是做在機器上解決問題(幾乎)上述實驗。現在,在運行GPU內存時程序崩潰的大部分時間都會被釋放。但是,仍然有時它不會發生,我必須重新啓動機器。

而且,我會做以下,以確保程序清除GPU內存時,程序成功運行:

net = nil 
trainset = nil 
testset = nil 
pred = nil 
inputGrad = nil 
criterion = nil 

collectgarbage()