2015-05-18 27 views
0

我有不平凡的任務,從大的CSV日誌看起來提取一些相關的數據,如提取使用bash公用事業文本數據的

Frame #,Residue,Internal,van der Waals,Electrostatic,Polar Solvation,Non-Polar Solv.,TOTAL 
1,1,119.745,0.356,-132.009,-95.618,1.7886312,-105.7373688 
1,2,106.093,-3.835,-182.473,40.582,0.7132608,-38.9197392 
1,3,21.228,-1.744,-38.026,-7.707,1.1189664,-25.1300336 
1,4,-5.717,-4.721,-30.38,-4.839,0.406512,-45.250488 
1,5,70.846,-4.127,-53.317,-2.534,0.7808472,11.6488472 
... 
2,1,119.745,0.356,-132.009,-95.618,1.7886312,-105.7373688 
2,2,106.093,-3.835,-182.473,40.582,0.7132608,-38.9197392 
2,3,21.228,-1.744,-38.026,-7.707,1.1189664,-25.1300336 
2,4,-5.717,-4.721,-30.38,-4.839,0.406512,-45.250488 
2,5,70.846,-4.127,-53.317,-2.534,0.7808472,11.6488472 
... 
n,1,119.745,0.356,-132.009,-95.618,1.7886312,-105.7373688 
n,2,106.093,-3.835,-182.473,40.582,0.7132608,-38.9197392 
n,3,21.228,-1.744,-38.026,-7.707,1.1189664,-25.1300336 
n,4,-5.717,-4.721,-30.38,-4.839,0.406512,-45.250488 
n,5,70.846,-4.127,-53.317,-2.534,0.7808472,11.6488472 

在這裏,我想最終選擇了第2列指定值(#residue),並根據第1列(#frame number)寫入其最後一列(#total energy)的寫入進化(#snapshot number列的功能)。換句話說,我需要1)排序的所有數據按照第2列第一個):即以選擇每個字符串,其中等於規定值(即n = 27)

#Frame, #Residue 

1,27, ... , # last column value which is interested for me! 
2,27, ... , # last column value which is interested for me! 
3,27, ... , # last column value which is interested for me! 
3,27, ... , # last column value which is interested for me! 

在第二列中的數字比其相應的最後一列的值提取所以resululting日誌將具有onlu 3列:

#Frame, #Residue, # Total energy 

1,27, # last column value which is interested for me! 
2,27, # last column value which is interested for me! 
3,27, # last column value which is interested for me! 
3,27, # last column value which is interested for me! 

將使用AWK感謝任何實現和sed!

謝謝!

格列布

回答

2

要提取在第二列27行,你可以使用grep

grep '^[^,]\+,27,' input.csv 
     | | | 
beginning | | 
    not comma | 
       repeated 

要只輸出的第一,第二和第8列,使用cut

grep '^[^,]\+,27' input.csv | cut -d, -f1,2,8 
            | | 
          delimiter | 
             fields 

要按第二列對文件進行排序,您可以使用sort

sort -t, -nk2,2 input.csv 
     | | | 
delimiter | | 
    numeric | 
    sort by only the second field 
+1

你可以在 「27」 後加上一個逗號,否則它可以匹配像270最大號,271,271337 ...: 'grep'可以^ [^,] \ +,27,'input.csv | cut -d,-f1,2,8' –

+0

'\ +'在POSIX基本正則表達式中是未定義的,所以你依賴於將'\ +'視爲「1或更多」的grep。這就是說,它應該是'*'而不是。 – geirha

+0

thx!一個問題:在初始data.csv的第i次提取後,要在腳本中添加什麼來停止提取這些行? E,g使用此命令僅提取n行。 – user3470313

2

下面是一個AWK溶液:

awk -v n=27 'BEGIN { OFS = FS = "," } $2 == n { print $1, $2, $NF }' input.csv 
  • -v n=27 - 通過分配AWK變量n27
  • BEGIN { OFS = FS = "," }開始 - AWK開始分析任何數據之前部分中運行的BEGIN。在這裏,我們將FS(場分隔符)和OFS(輸出字段分隔符)都設置爲「,」,以便輸入行和輸出行將以逗號分隔/分隔。
  • $2 == n { print $1, $2, $NF } - 對於第二個字段($ 2)等於n的任何記錄(行),輸出第一個,第二個和最後一個字段。

比賽停止:

awk -v n=27 -v m=3 'BEGIN { OFS = FS = "," } $2 == n { print $1, $2, $NF; if (++count == m) exit}' input.csv 
+0

thx!再次提問:在初始data.csv的第i次提取後,要在腳本中添加什麼來停止提取這些行?E,g使用此命令僅提取n行。 – user3470313

+0

@ user3470313我已經更新了答案。 – geirha