我有大約3000個文件。每個文件都有大約55000行/標識符和大約100列。我需要計算每個文件的行方向相關性或加權協方差(取決於文件中的列數)。所有文件中的行數相同。我想知道爲每個文件計算相關矩陣的最有效方法是什麼?我已經嘗試過Perl和C++,但是它需要花費很多時間來處理一個文件 - Perl需要6天,C需要一天以上的時間。通常情況下,我不想每個文件花費15-20分鐘以上。行計算相關/協方差矩陣的有效方法
現在,我想知道如果我可以使用一些技巧或東西更快地處理它。這裏是我的僞代碼:
while (using the file handler)
reading the file line by line
Storing the column values in hash1 where the key is the identifier
Storing the mean and ssxx (Sum of Squared Deviations of x to the mean) to the hash2 and hash3 respectively (I used hash of hashed in Perl) by calling the mean and ssxx function
end
close file handler
for loop traversing the hash (this is nested for loop as I need values of 2 different identifiers to calculate correlation coefficient)
calculate ssxxy by calling the ssxy function i.e. Sum of Squared Deviations of x and y to their mean
calculate correlation coefficient.
end
現在,我計算一對的相關係數只有一次,我沒有計算相同標識符的相關係數。我已經採取我的嵌套for循環照顧。你認爲是否有辦法更快地計算相關係數?任何提示/建議都會很棒。謝謝!
EDIT1: 我輸入文件看起來是這樣的 - 前10個標識符:
"Ident_01" 6453.07 8895.79 8145.31 6388.25 6779.12
"Ident_02" 449.803 367.757 302.633 318.037 331.55
"Ident_03" 16.4878 198.937 220.376 91.352 237.983
"Ident_04" 26.4878 398.937 130.376 92.352 177.983
"Ident_05" 36.4878 298.937 430.376 93.352 167.983
"Ident_06" 46.4878 498.937 560.376 94.352 157.983
"Ident_07" 56.4878 598.937 700.376 95.352 147.983
"Ident_08" 66.4878 698.937 990.376 96.352 137.983
"Ident_09" 76.4878 798.937 120.376 97.352 117.983
"Ident_10" 86.4878 898.937 450.376 98.352 127.983
EDIT2:這裏是段/子程序或者說,我在Perl寫的功能
## Pearson Correlation Coefficient
sub correlation {
my($arr1, $arr2) = @_;
my $ssxy = ssxy($arr1->{string}, $arr2->{string}, $arr1->{mean}, $arr2->{mean});
my $cor = $ssxy/sqrt($arr1->{ssxx} * $arr2->{ssxx});
return $cor ;
}
## Mean
sub mean {
my $arr1 = shift;
my $mu_x = sum(@$arr1) /scalar(@$arr1);
return($mu_x);
}
## Sum of Squared Deviations of x to the mean i.e. ssxx
sub ssxx {
my ($arr1, $mean_x) = @_;
my $ssxx = 0;
## looping over all the samples
for(my $i = 0; $i < @$arr1; $i++){
$ssxx = $ssxx + ($arr1->[$i] - $mean_x)**2;
}
return($ssxx);
}
## Sum of Squared Deviations of xy to the mean i.e. ssxy
sub ssxy {
my($arr1, $arr2, $mean_x, $mean_y) = @_;
my $ssxy = 0;
## looping over all the samples
for(my $i = 0; $i < @$arr1; $i++){
$ssxy = $ssxy + ($arr1->[$i] - $mean_x) * ($arr2->[$i] - $mean_y);
}
return ($ssxy);
}
您能否提供典型輸入文件的摘錄? – MBo 2014-09-28 05:44:44
已添加文件的前10行。 – snape 2014-09-28 06:34:18
除了性能問題,您的計算可能不正確。 – 2014-09-28 12:43:40