2012-03-13 48 views
3

我有一個包含類似數據的文件:搜索的行中的特定字段文件

0000380000000101 
0000650000000201 
0000650000000301 
0000650000000401 
0001000000000101 
0001000000000201 

....等等。我想處理這些數據,讓我得到這樣

000065 0000000201 0000000301 0000000401 
000100 0000000101 0000000201 

由於000065的輸出重複3次,在輸出我想000065只出現一次,而在每個條目對應的字節只要發生000065應打印。因爲,000038只有一次,我不想要這個輸出。在這個例子中,數據(即000065或000038碰巧是3個字節,儘管它可以是任何長度,而像0000000401之後的字節將是固定長度,即5個字節)。我想要最好使用shell腳本或c。請讓我知道我該怎麼做。 awk可以在這裏有所幫助嗎? 任何幫助將不勝感激。下面是實際的文件所採取的數據,我想的過程:

0000000000000101 
0000000000000201 
0000000000000301 
0000000000000401 
0000380000000101 
0000650000000201 
0000650000000301 
0000650000000401 
0001000000000101 
0001000000000201 
0001000000000301 
0001000000000401 
0038d30000000101 
00652e0000000201 
00652e0000000301 
00652e0000000401 
008d750000000101 
008d750000000201 
008d750000000301 
008d750000000401 
0100010000000101 
0100010000000201 
0100010000000301 
0100010000000401 
01008d0000000101 
01008d0000000201 
01008d0000000301 
01008d0000000401 
01a8c00000000101 
01a8c00000000201 
01a8c00000000301 
01a8c00000000401 
0264010000000101 
0264010000000201 
0264010000000301 
0264010000000401 
0615df0000000101 
0615df0000000201 
0615df0000000301 
0615df0000000401 
07dd940000000101 
07dd940000000201 
07dd940000000301 
07dd940000000401 
0900000000000101 
0900000000000201 
0900000000000301 
0900000000000401 
15dfc70000000101 
15dfc70000000201 
15dfc70000000301 
15dfc70000000401 
1ecf090000000101 

回答

1

這可能會實現爲你(是sed行嗎?):

sed ':a;$!N;s/^\(.*\)\(\(*.\{10\}\)*\)\n\1/\1\2 /;ta;/ /!D;s/.\{10\}/&/;P;D' file 
000065 0000000201 0000000301 000000401 
000100 0000000101 0000000201 
4

你的數據是固定的寬度,所以你可以使用gawk

$ gawk -v FIELDWIDTHS='6 10' 'NR!=1 && x==$1""{printf(" %s", $2); next}; {x=$1""; printf("%s%s %s", NR==1?"":"\n", $1, $2)}; END{print ""}' input.txt | sed '/^[0-9a-f]* [0-9a-f]*$/d' 
000000 0000000101 0000000201 0000000301 0000000401 
000065 0000000201 0000000301 0000000401 
000100 0000000101 0000000201 0000000301 0000000401 
00652e 0000000201 0000000301 0000000401 
008d75 0000000101 0000000201 0000000301 0000000401 
010001 0000000101 0000000201 0000000301 0000000401 
01008d 0000000101 0000000201 0000000301 0000000401 
01a8c0 0000000101 0000000201 0000000301 0000000401 
026401 0000000101 0000000201 0000000301 0000000401 
0615df 0000000101 0000000201 0000000301 0000000401 
07dd94 0000000101 0000000201 0000000301 0000000401 
090000 0000000101 0000000201 0000000301 0000000401 
15dfc7 0000000101 0000000201 0000000301 0000000401 

FIELDWIDTHS A white-space separated list of fieldwidths. When set, gawk parses the input into fields of fixed width, instead of using the value 
       of the FS variable as the field separator. 
+0

[UUOC](https://en.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_cat)alert! – 2012-03-13 12:50:08

+0

你是一位awk高手! – 2012-03-13 12:51:39

+0

在Mac上沒有幫我工作 – anubhava 2012-03-13 12:55:52

1

awk與FIELDWIDTHS是顯示kev的一種方式。

這裏是另一種方式(oneliner)僅使用awk:

awk 'BEGIN{FS=""} 
    {for(i=1;i<=6;i++) x=x$i; y=$0; gsub("^"x,"",y);a[x]=a[x]?a[x]" "y:y; x="";} 
    END{for(t in a)print t" "a[t]}' yourFile 

測試你的小數據塊:

kent$ echo "0000380000000101 
0000650000000201 
0000650000000301 
0000650000000401 
0001000000000101 
0001000000000201"|awk 'BEGIN{FS=""} {for(i=1;i<=6;i++) x=x$i; y=$0; gsub("^"x,"",y);a[x]=a[x]?a[x]" "y:y; x="";}END{for(t in a)print t" "a[t]}' 

000100 0000000101 0000000201 
000065 0000000201 0000000301 0000000401 
000038 0000000101 
2

可以以下awk命令(在Linux和Mac測試):

awk '{key=substr($0, 0, 6); val=substr($0, 6); arr[key]=sprintf("%s %s",val,arr[key]);} 
END{for (a in arr) {split(arr[a], el, " "); if (length(el)>1) print a, arr[a]} }' file 

OUTPUT:

000065 50000000401 50000000301 50000000201 
000100 00000000201 00000000101 
2

首先,管你的數據通過本文件:

awk '{suffixLen = 10; print substr($0, 1, length($0) - suffixLen)" "substr($0, length($0) - suffixLen + 1, length($0))}' 

的suffixLen變量是(固定的)數量的尾隨字符:2個字節用於每個字符= 10。這將在分割輸入串兩個領域,用空格隔開。結果

awk '{if ($1 in values) {values[$1] = values[$1]" "$2} else {values[$1] = $1" "$2}}END{for (v in values) print values[v]}' 

正確排序留給讀者作爲練習讀者:

通過這個

然後管。