構建單詞矩陣時八度非常慢

我有一個詞彙（字符串的矢量）和一個充滿句子的文件。我想要構造一個矩陣來顯示每個句子包含每個單詞的頻率。我目前的執行速度非常慢，我相信這可以更快。一個約十個單詞的句子需要幾分鐘的時間。構建單詞矩陣時八度非常慢

你能解釋一下爲什麼這樣以及如何加快速度？

備註：我使用稀疏矩陣，因爲它不適合內存。詞彙大小約爲10.000字。運行程序並不會耗盡我的工作記憶，所以不能成爲問題。

這裏是相關的代碼。之前未提及的變量被初始化，如totalLineCount，vocab和vocabCount。

% initiate sentence structure 
wordSentenceMatrix = sparse(vocabCount, totalLineCount); 
% fill the sentence structure 
fid = fopen(fileLocation, 'r'); 
lineCount = 0; 
while ~feof(fid), 
    line = fgetl(fid); 
    lineCount = lineCount + 1; 
    line = strsplit(line, " "); 
    % go through each word and increase the corresponding value in the matrix 
    for j=1:size(line,2), 
     for k=1:vocabCount, 
      w1 = line(j); 
      w2 = vocab(k); 
      if strcmp(w1, w2), 
       wordSentenceMatrix(k, lineCount) = wordSentenceMatrix(k, lineCount) + 1; 
      end; 
     end; 
    end; 
end;

來源

2013-06-26 Florian Dietz

稀疏矩陣實際上存儲在內存中的三個數組中。用簡化的語言，您可以將其存儲描述爲一個行索引數組，一個列索引數組和一個非零的入口值數組。（一個更復雜的故事被稱爲compressed sparse column。）

通過在代碼中通過擴展稀疏矩陣元素的元素，可以反覆更改該矩陣（或稀疏模式）的結構。這不被推薦，因爲它涉及大量的內存拷貝。

您查詢詞彙表中單詞索引的方式也很慢，因爲對於句子中的每個單詞，您都要查看整個詞彙表。更好的方法是在Matlab中使用Java HashMap。

我修改您的代碼如下：

rowIdx = []; 
colIdx = []; 
vocabHashMap = java.util.HashMap; 
for k = 1 : vocabCount 
    vocabHashMap.put(vocab{k}, k); 
end 

fid = fopen(fileLocation, 'r'); 
lineCount = 0; 
while ~feof(fid), 
    line = fgetl(fid); 
    lineCount = lineCount + 1; 
    line = strsplit(line, " "); 
    % go through each word and increase the corresponding value in the matrix 
    for j = 1 : length(line) 
     rowIdx = [rowIdx; vocabHashMap.get(line{j})]; 
     colIdx = [colIdx; lineCount]; 
    end 
end 
assert(length(rowIdx) == length(colIdx)); 
nonzeros = length(rowIdx); 
wordSentenceMatrix = sparse(rowIdx, colIdx, ones(nonzeros, 1));

當然，如果你知道你的文字收集先驗的長度，應預先分配的rowIdx和colIdx內存：

rowIdx = zeros(nonzeros, 1); 
colIdx = zeros(nonzeros, 1);

如果可以，請將它移植到Octave。

來源

2013-06-26 18:49:07

謝謝，這個作品完美。 –

構建單詞矩陣時八度非常慢

回答

相關問題