Matlab histc with vector bin

如果我有兩個矩陣A和B，大小爲[mxn]和[pxn]，我想要找出B中每行出現在A中的次數，例如：Matlab histc with vector bin

>> A = rand(5,3) 

A = 

    0.1419 0.6557 0.7577 
    0.4218 0.0357 0.7431 
    0.9157 0.8491 0.3922 
    0.7922 0.9340 0.6555 
    0.9595 0.6787 0.1712 

>> B = [A(2,:); A(1,:); A(2,:); A(3,:); A(3,:); A(4,:); A(5,:)] 

B = 

    0.4218 0.0357 0.7431 
    0.1419 0.6557 0.7577 
    0.4218 0.0357 0.7431 
    0.9157 0.8491 0.3922 
    0.9157 0.8491 0.3922 
    0.7922 0.9340 0.6555 
    0.9595 0.6787 0.1712

在這種情況下，作爲答案

ans = 

    1  2  2  1  1

雖然不像這個例子中，在廣義M >> p

如果A和B是向量MATLAB的histc會做的工作，但有似乎並不等同alent如果垃圾箱是矢量。

目前我做的：

for i=1:length(B) 
    indices(i) = find(abs(A/B(i,:)-1) < 1e-15); 
    % division requires a tolerance due to numerical issues 
end 
histc(indices, 1:size(A,1)) 

ans = 

    1  2  2  1  1

但因爲我有很多這樣的矩陣B，A和B都大，這是可怕的慢。任何想法如何改善呢？

編輯：在方法

展望到目前爲止，我有以下數據：

A     7871139x3    188907336 double      
B      902x3     21648 double

爲了使事情更快，我只是要使用第10排B的

B = B(1:10,:);

請注意，對於完整的應用程序，我（當前）有> 10^4的這種矩陣（這最終將> 10^6 ....）

我的第一種方法：

tic, C = get_vector_index(A,B); toc 
Elapsed time is 36.630107 seconds.

bdecaf的方法

>> tic, C1 = get_vector_index(A,B); toc 
Elapsed time is 28.957243 seconds. 
>> isequal(C, C1) 

ans = 

    1

OLI的pdist2方法（可以通過去除if語句和使用L1距離，而不是L2距離被減小到〜25秒），

>> tic, C2 = get_vector_index(A,B); toc 
Elapsed time is 7.244965 seconds. 

>> isequal(C,C2) 

ans = 

    1

oli的標準化方法

>> tic, C3 = get_vector_index(A,B); toc 
Elapsed time is 3.756682 seconds. 

>> isequal(C,C3) 

ans = 

    1

最後，我想出了另一種方法，我搜索第一列，然後搜索第一列的命中內的第二列，遞歸直到列耗盡。這是迄今爲止最快的....

N = size(A,2); 
loc = zeros(size(B,1),1); 
for i=1:size(B,1) 
    idx{1} = find(A(:,1)==B(i,1)); 
    for k=2:N, 
     idx{k} = idx{k-1}(find(A(idx{k-1},k)==B(i,k))); 
    end 
    loc(i) = idx{end}; 
end 
C = histc(loc, 1:size(A,1));

導致：

>> tic, C4 = get_vector_index(A,B); toc 
Elapsed time is 1.314370 seconds. 

>> isequal(C, C4) 

ans = 

    1

另外請注意，使用intersect慢得多：

>> tic, [~,IA] = intersect(A,B,'rows'); C5 = histc(IA,1:size(A,1)); toc 
Elapsed time is 44.392593 seconds. 

>> isequal(C,C5) 

ans = 

    1

來源

2011-11-30 tdc

順便說一下，我認識到我當前方法中的矩陣分解也是可能的 - 它可能（幾乎）可能得到假的誤報，但它會產生一個錯誤，因爲每個B的向量只能有一個匹配在所有的情況下，我有。 – tdc

只是一個技巧 - 只要B中的所有行都在A中有一行，它就會起作用。另外，如果A中有兩個幾乎相同的行，則會產生問題。行爲也與histc不同。 'histc'分類到*最近* bin中 - 當它剛好在垃圾箱中時，你做出了它。 – bdecaf

也要小心使用長度 - 它將始終返回最大尺寸。所以如果'p bdecaf

我居然把這個解決方案作爲這個問題的編輯，而是接受一個答案的緣故，我會把解決方案在這裏也：

N = size(A,2); 
loc = zeros(size(B,1),1); 
for i=1:size(B,1) 
    idx{1} = find(A(:,1)==B(i,1)); 
    for k=2:N, 
     idx{k} = idx{k-1}(find(A(idx{k-1},k)==B(i,k))); 
    end 
    loc(i) = idx{end}; 
end 
C = histc(loc, 1:size(A,1));

導致：

>> tic, C4 = get_vector_index(A,B); toc 
Elapsed time is 1.314370 seconds. 

>> isequal(C, C4) 

ans = 

    1

來源

2011-12-15 17:47:03 tdc

也許你可以規範他們，這樣你檢查他們的點積爲1

A = rand(5,3); 
B = [A(2,:); A(1,:); A(2,:); A(3,:); A(3,:); A(4,:); A(5,:)]; 
A2=bsxfun(@times,A,1./sqrt(sum(A.^2,2))); %%% normalize A 
B2=bsxfun(@times,B,1./sqrt(sum(B.^2,2))) %%% normalize B 
sum(A2*B2'>1-10e-9,2) %%% check that the dotproduct is close to 1 

ans = 

    1 
    2 
    2 
    1 
    1

如果你需要的東西更快，但近似的，我推薦你使用FLANN庫，它是用於快速近似近鄰：

http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN

來源

2011-11-30 12:14:34 Oli

我很喜歡這種方法，但矩陣乘法不適合內存（例如A是7871139x3和B是903x3） – tdc

我也看過FLANN，但我不認爲我可以安裝它，因爲我沒有sudoers帳戶 - 從源代碼編譯需要cmake（這是沒有安裝，並安裝，只是炸彈爲我，大概是因爲我不是一個sudoer），目前他們的預建庫只是Ubuntu/Debian（服務器是紅帽） – tdc

沒有sudo權限就可以安裝它。 – Oli

我會解決這個問題是這樣的：

indices = zeros(size(A,1),1); 
for i=1:size(B,1) 
    distances = sum((repmat(B(i,:),size(A,1),1)-A).^2 ,2); 
    [md,im]=min(distances); 
    if md < 1e-9 
     indices(im) = indices(im)+1; 
    end 
end

如果您刪除它，只是將它排入最近的垃圾箱。

來源

2011-11-30 12:27:19 bdecaf

稍微比@oli的方法慢 – tdc

確實是一個優雅。 – bdecaf

其實，這樣做的一個簡單的方法是：

sum(10e-9>pdist2(A',B'),2)

它計算所有成對距離和門限和計數。

來源

2011-11-30 12:31:29 Oli

這對我的方法有點改進（快5倍）... – tdc

Matlab histc with vector bin

回答

相關問題