大型數據集

更快grpstats我有一個大的數據集Matlab的（1924014由5;〜73.4 MB）大型數據集

Date   id   a   b   c 
... 
733234  1467   1.2656  1.2718  51.16  
733235  1467   1.2732  1.2794  51.16  
733236  1467   1.2781  1.2844  51.5  
733236  1467   1.26   NaN  NaN  
733237  1467   1.3084   NaN  NaN  
733237  1467   1.3205   NaN  NaN  
733238  1467   1.3125  1.3188  53.85  
733238  1467    1.3   NaN  NaN  
...

Date是datenum形式的日期。
我需要平均（忽略NaN s）最後三列的唯一Date + id對，因爲有時對於給定的Date + id對有多於一行。

我想輸出是

Date   id   mean_a  mean_b  mean_c 
... 
735234  1467   1.2656  1.2718  51.16  
735235  1467   1.2732  1.2794  51.16  
735236  1467   1.2691  1.2844  51.5  
735237  1467   1.3144   NaN  NaN  
735238  1467   1.3062  1.3188  53.85  
...

我希望能夠使用

grpstats(myDataset, {'Date', 'id'}, 'mean')

但它是慢得。我預計這項任務可以在60秒內完成。我認爲grpstats正在添加一個GroupCount列，併爲每個觀察值添加名稱，這些我不需要。

我該如何快速做到這一點？無論他們是否使用grpstats，我都樂於接受。

來源

2013-06-19 Prashant Kumar

集團按日期和id與unique(...,'rows')，進而產生累加subs多個列與meshgrid()，或者明確地repmat()，最後採取了@nanmean與accumarray()：

% Group by date and id 
[un,~,pos] = unique(db(:,1:2),'rows'); 

% Produce row, col subs 
[col,row] = meshgrid(1:3,pos); 

% Accumulate 
[un accumarray([row(:), col(:)], reshape(db(:,3:5),[],1),[],@nanmean)]

來源

2013-06-19 16:42:00 Oleg

非常有前途的！在我的機器上不到30秒。我真的需要學習如何使用meshgrid/reshape。現在檢查輸出... –

時間meshgrid，如果它足夠長，例如1/3的時間，我會將repmat方法發佈到subs創作。 – Oleg

數據對我來說很好！這對我的目的來說很快。就在這個時候，99％的時間花在了積累上。 –

回答

相關問題