我目前正在matlab中實現一個算法,通過購買某些文章的客戶數據庫進行搜索。這個數據庫看起來如下:更快地搜索一個巨大的陣列matlab
[ 0 1 2 3 4 5 NaN NaN;
4 6 7 8 NaN NaN NaN NaN;
...]
只是大小的東西是大小(數據)= [90810 30]。現在我想在該數據庫中找到頻繁的項目集(沒有太多使用工具箱)。我將提供這裏toyexample:
toyset = [
0, 1, 2, 3, 4, 5, 6, 7, 8, 9;
5, 6, 7,NaN,NaN,NaN,NaN,NaN,NaN,NaN;
5, 6, 7,NaN,NaN,NaN,NaN,NaN,NaN,NaN;
1, 6, 7, 9, 10, 11,NaN,NaN,NaN,NaN;
2, 4, 8, 11, 12,NaN,NaN,NaN,NaN,NaN];
這將施加0.5的最小支持時生成以下項集[支持=(occurences_of_set)/(all_sets)〕:
frequent_itemsets = [
7,NaN,NaN;
6,NaN,NaN;
5,NaN,NaN;
6, 7,NaN;
5, 7,NaN;
5, 6,NaN;
5, 6, 7];
我現在的問題是查找數據集中項目集的頻率。目前我使用下面的算法(它完美的作品btw):
function list = preprocess(subjectArray, combinations, progressBar)
% =========================================================================
%
% Creates a list which indicates how often an article-combination given by
% combinations is present in the array of Customers
%
% =========================================================================
%
% preprocesses the array; Finds the frequency of articles
% subjectArray - Array that contains customer data
% combinations - The article combinations to be found
% progressBar - The progress bar to indicate the progress of the
% algorithm
%
% =========================================================================
[countCustomers,maxSizeCustomers] = size(subjectArray);
[countCombinations,sizeCombinations] = size(combinations);
list=zeros(1,countCombinations);
for i = 1:countCustomers
waitbar(i/countCustomers,progressBar,sprintf('Preprocess: %.0f/%.0f\nSet size:%.0f',i,countCustomers,sizeCombinations));
for k = 1 : countCombinations
helpArray = zeros(1,maxSizeCustomers);
help2Array = zeros(1,sizeCombinations);
for j = 1:sizeCombinations
helpArray = helpArray + (subjectArray(i,:) == combinations(k,j));
help2Array(j) = any(helpArray);
end
list(k) = list(k) + all(help2Array);
end
end
end
我唯一的問題是,這是需要年齡!從字面上看!是否有任何簡單的可能性(除了長度爲1的集合,我知道可以通過簡單的計數加快速度)使其更快?
我認爲這樣的:
helpArray = helpArray + (subjectArray(i,j) == combinations(k,:));
是瓶頸?但我不確定,因爲我不知道matlab執行某些操作的速度有多快。
感謝尋找到它,mind_
我最終什麼了這樣做的:
function list = preprocess(subjectArray, combinations)
% =========================================================================
%
% Creates a list which indicates how often an article-combination given by
% combinations is present in the array of Customers
%
% =========================================================================
%
% preprocesses the array; Finds the frequency of articles
% subjectArray - Array that contains customer data
% combinations - The article combinations to be found
%
% =========================================================================
[countCustomers,maxSizeCustomers] = size(subjectArray);
[countCombinations,sizeCombinations] = size(combinations);
list=zeros(1,countCombinations);
if sizeCombinations == 1
for i = 1 : countCustomers
for j = 1 : maxSizeCustomers
x = subjectArray(i,j) + 1;
if isnan(x), break; end
list(x+1) = list(x+1) + 1;
end
end
else
for i = 1:countCombinations
logical = zeros(size(subjectArray));
for j = 1:sizeCombinations
logical = logical + (subjectArray == combinations(i,j));
end
list(i) = sum(sum(logical,2) == sizeCombinations);
end
end
end
感謝所有的支持!
你能概念性地解釋你如何得到'頻繁_集合'嗎? – Oleg
用上面的算法確定項目集在數據中的頻率,然後刪除所有不頻繁的項目集。在這個例子中,因此0-4和8-12。然後我建立剩餘的所有可能的組合並再次運行算法。在這個例子中:5,6 5,7 6,7 –
噢,是的,我還做了一些事情,就是事後收縮數據,這樣我就不必一遍又一遍地查看不包含任何頻繁項目集的項目集。 –