2014-04-16 81 views
0

我有一個稀疏矩陣.txt文件,其中包含許多數字,但大多爲0。下面是此.txt文件的樣品:如何將稀疏矩陣數據轉換爲WEKA的arff文件格式?

0 0 0.271178 0 0 0 0.538776 0 -0.631228 0 0 -0.501485 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.438643 0 0 0 0 0 0.287754 0 0 0 0 0 0 0 0 0 0 0 0 -0.498918 0 0 0 0 0 0.475561 0 0 0 0 0 0 0 0 0 0 0 0.370479 0 -0.300765 0 -1.10987 0 0 0.163637 0 0 0 0 0 0 0.304006 0 0.181697 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.261758 0 0 0 0.153415 0.17412 0 -0.129725 0 0.17598 0 0 0 0 0 0 0 0.56053 0 0 -0.211302 0 0 0 0 0 0 0 0.213277 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0698458 0 0 0 0.661972 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.376654 0 0 -0.604655 0 0 0 0 0 0.301025 0 0 -0.431324 0 0 -0.139445 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.404718 0.610013 -0.286108 0 0.46045 0 0 0 0 0 0 -0.193543 0 0 0 0 0 0 0 0.315063 0 0 -0.285598 0 0 0.206607 0

我需要做的是能夠通過這個文件,並把數據轉換成WEKA's 識別的文件格式,.arff,這樣我就可以運行該數據集的數據挖掘算法。有誰知道如何做到這一點?我在StackOverflow上發現了一些類似於我的問題的類似問題,但是,他們使用的是AWK語言的腳本,之前我從未使用過.awk程序,所以無法使這些程序正常運行。 這裏是鏈接到這些職位: 12

+1

輸出文件應該是什麼樣子? –

回答

1

看了一眼the ARFF file format和你鏈接的問題。

您可以嵌入一個shell腳本里面awk程序很容易地,所以你可以這樣做:

#!/bin/sh 
file=$1 
awk ' 
    BEGIN {print "@RELATION something\n"} 
    { 
     data = sep = "" 
     for (i=1; i<=NF; i++) { 
      if ($i != 0) { 
       printf "@ATTRIBUTE a_%d NUMERIC\n", i 
       data = data sep $i 
       sep = "," 
      } 
     print "" 
     print "@DATA" 
     print data 
    } 
' "$file" 

調用腳本:bash script.sh data_file看看

@RELATION something 

@ATTRIBUTE a_3 NUMERIC 
@ATTRIBUTE a_7 NUMERIC 
@ATTRIBUTE a_9 NUMERIC 
@ATTRIBUTE a_12 NUMERIC 
@ATTRIBUTE a_33 NUMERIC 
@ATTRIBUTE a_39 NUMERIC 
@ATTRIBUTE a_52 NUMERIC 
@ATTRIBUTE a_58 NUMERIC 
@ATTRIBUTE a_70 NUMERIC 
@ATTRIBUTE a_72 NUMERIC 
@ATTRIBUTE a_74 NUMERIC 
@ATTRIBUTE a_77 NUMERIC 
@ATTRIBUTE a_84 NUMERIC 
@ATTRIBUTE a_86 NUMERIC 
@ATTRIBUTE a_124 NUMERIC 
@ATTRIBUTE a_128 NUMERIC 
@ATTRIBUTE a_129 NUMERIC 
@ATTRIBUTE a_131 NUMERIC 
@ATTRIBUTE a_133 NUMERIC 
@ATTRIBUTE a_141 NUMERIC 
@ATTRIBUTE a_144 NUMERIC 
@ATTRIBUTE a_152 NUMERIC 
@ATTRIBUTE a_185 NUMERIC 
@ATTRIBUTE a_189 NUMERIC 
@ATTRIBUTE a_207 NUMERIC 
@ATTRIBUTE a_210 NUMERIC 
@ATTRIBUTE a_216 NUMERIC 
@ATTRIBUTE a_219 NUMERIC 
@ATTRIBUTE a_222 NUMERIC 
@ATTRIBUTE a_238 NUMERIC 
@ATTRIBUTE a_239 NUMERIC 
@ATTRIBUTE a_240 NUMERIC 
@ATTRIBUTE a_242 NUMERIC 
@ATTRIBUTE a_249 NUMERIC 
@ATTRIBUTE a_257 NUMERIC 
@ATTRIBUTE a_260 NUMERIC 
@ATTRIBUTE a_263 NUMERIC 

@DATA 
0.271178,0.538776,-0.631228,-0.501485,0.438643,0.287754,-0.498918,0.475561,0.370479,-0.300765,-1.10987,0.163637,0.304006,0.181697,-0.261758,0.153415,0.17412,-0.129725,0.17598,0.56053,-0.211302,0.213277,0.0698458,0.661972,0.376654,-0.604655,0.301025,-0.431324,-0.139445,0.404718,0.610013,-0.286108,0.46045,-0.193543,0.315063,-0.285598,0.206607 

但你必須讓我們知道如果這是你想要輸出。

+0

+1爲了多花26英里去調查問題!我只是想告訴OP縮寫他們的輸入併發布預期的輸出! –

0

偉大的腳本Glenn,但它缺少索引並將其放入屬性名稱中。例如:

@attribute a_0 numeric 
@attribute a_1 numeric 
@attribute a_2 numeric 
@attribute a_3 numeric 
... 

@data 
{3 0.271178,...} 

這是用於避免丟失屬性位置信息所需的稀疏ARFF格式。我的建議是採用密集格式(只是CSV)並使用WEKA過濾器NonSparseToSparse

相關問題