您可以利用Map,它爲您提供高效的基於字典的結構:
對於每一個單詞保存顯示在每個字符串中的出現的向量:
A = {'life is wonderful', 'matlab makes your dreams come true'};
B = {'life would be meaningless without wonderful matlab', 'what a wonderful world', 'the shoemaker makes shoes', 'rock and roll baby'};
mapA = containers.Map();
sizeA = size(A,2);
for i = 1:size(A,2) % for each string
a = regexpi(A(i),'\w+','match');
for w = a{:} % for each word extracted
str = cell2mat(w);
if(mapA.isKey(str)) % if word already indexed
occ = mapA(str);
else % new key
occ = zeros(1,sizeA);
end
occ(i) = occ(i)+1;
mapA(str) = occ;
end
end
% same for B
mapB = containers.Map();
sizeB = size(B,2);
for i = 1:size(B,2)
a = regexpi(B(i),'\w+','match');
for w = a{:}
str = cell2mat(w);
if(mapB.isKey(str))
occ = mapB(str);
else
occ = zeros(1,sizeB);
end
occ(i) = occ(i)+1;
mapB(str) = occ;
end
end
然後,對於發現的每個獨特的單詞A,計算匹配與乙
match = zeros(size(A,2),size(B,2));
for w = mapA.keys
str = cell2mat(w);
if (mapB.isKey(str))
match = match + diag(mapA(str))*ones(size(match))*diag(mapB(str));
end
end
結果:
match =
2 1 0 0
1 0 1 0
這種方式,您有一個#wordsA + #wordsB +的#singleWordsA代替#wordsA *#wordsB
編輯的複雜性:或者,如果你不喜歡Map
,你可以保存字按字母順序排列的向量中的發生向量。然後,你可以看看比賽同時檢查兩個載體:
(假設我們使用的是結構,其中w
屬性是字串和occ
是發生向量)
i = 1; j = 1;
while(i<=size(wordsA,2) && i<=size(wordsB,2))
if(strcmp(wordsA(i).w, wordsB(j).w))
% update match
else
if(before(wordsA(i).w, wordsA(i).w)) % before: fancy function returning 1 if the first argument comes (alphabetically) before the second one (no builtin function comes to my mind)
i = i+1;
else
j = j+1;
end
end
,如果你正在尋找' matlab',你知道在第10個位置存儲'生命'是無用的檢查位置之前,因爲矢量按字母順序排列。所以我們有嵌套循環解決方案的#wordsA +#wordsB iteration與#wordsA *#wordsB。
你是否對@Amro的解決方案反對? – Jonas
似乎鐘錶大致相同,即使是較大的數據集 – Batsu