2016-10-04 58 views
0

我目前正在處理包含格式化爲數據塊的文件信息的大型數據集。我正在嘗試從文件路徑行獲取一段數據,並將其作爲新列添加到特定行上。該數據集包含格式化的,像這樣的文件信息:使用awk或sed格式化特定數據

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar 
Inode Num: 22525898 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
45:97:2a:60:e3:69    3208     10 
7a:8b:8e:20:7b:38    1982     10 
b9:45:3d:f4:97:88    1849     10 
Whole File Hash: 865999b40fd9 

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c 
Inode Num: 31881221 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
e8:b0:cb:6f:76:ff    1344     10 
19:c5:b2:aa:b3:60    613      10 
11:7c:7e:76:4b:d5    1272     10 
36:e0:59:49:b6:4a    581      10 
9c:31:bc:8a:39:94    3296     10 
01:f0:56:3a:e1:a9    1140     10 
Whole File Hash: 4b28b44ae03d 

我所想要做的是採取文件類型(.jar和.C在這個例子中),並追加到各自的塊散列行,以便最終格式化看起來像:

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar 
Inode Num: 22525898 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth)  
45:97:2a:60:e3:69    3208     10        .jar 
7a:8b:8e:20:7b:38    1982     10        .jar 
b9:45:3d:f4:97:88    1849     10        .jar 
Whole File Hash: 865999b40fd9 

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c 
Inode Num: 31881221 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth)  
e8:b0:cb:6f:76:ff    1344     10        .c 
19:c5:b2:aa:b3:60    613      10        .c 
11:7c:7e:76:4b:d5    1272     10        .c 
36:e0:59:49:b6:4a    581      10        .c 
9c:31:bc:8a:39:94    3296     10        .c 
01:f0:56:3a:e1:a9    1140     10        .c 
Whole File Hash: 4b28b44ae03d 

我已經有awk的代碼拉文件類型和塊散列線:

awk 'match($0,/\..+/) {print substr($0,RSTART,RLENGTH)}' 

awk '/Chunk Hash/{flag=1;next}/Whole File Hash:/{flag=0}flag' 

我只是對如何使用這些連接件不知道wk(或sed)將文件類型作爲新列附加到其各自數據塊中的每一行上。另一件需要注意的是,我正試圖在bash腳本中做到這一點,如果這有所作爲。

回答

2

這裏是一個(GNU)sed的溶液:

/File path:/ {   # If line matches "File path:" 
    h     # Copy pattern space to hold space 
    s/.*(\.[^.]*)$/\1/ # Remove everything but extension from pattern space 
    x     # Swap pattern space and hold space 
}      # Hold space now contains extension 
/Chunk Hash/ {   # If line matches "Chunk Hash" 
    n     # Get next line into pattern space 
    :loop    # Anchor for loop 
    /Whole File Hash/b # If line matches "Whole File Hash", jump out of loop 
    G     # Append extension from hold space to pattern space 
    s/\n/\t\t\t\t/  # Substitute newline with a bunch of tabs 
    n     # Get next line 
    b loop    # Jump back to ":loop" label 
} 

這可以被存儲在單獨的文件中(說,so.sed),並且必須被稱爲像

sed -r -f so.sed infile 

導致

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar 
Inode Num: 22525898 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
45:97:2a:60:e3:69    3208     10        .jar 
7a:8b:8e:20:7b:38    1982     10        .jar 
b9:45:3d:f4:97:88    1849     10        .jar 
Whole File Hash: 865999b40fd9 

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c 
Inode Num: 31881221 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
e8:b0:cb:6f:76:ff    1344     10        .c 
19:c5:b2:aa:b3:60    613      10        .c 
11:7c:7e:76:4b:d5    1272     10        .c 
36:e0:59:49:b6:4a    581      10        .c 
9c:31:bc:8a:39:94    3296     10        .c 
01:f0:56:3a:e1:a9    1140     10        .c 
Whole File Hash: 4b28b44ae03d 

非GNU SEDS必須通過the usual hoops跳轉到插入選項卡並不能使用-r選項(但可能-E,這應該是相當於在這裏; -r只是爲了方便才得以逃脫())。

+0

某些行加倍,應刪除從地址範圍塊的'p'命令。 – SLePort

+1

@Kenavoz呃,是的,'N'沒有'-n'選項打印......謝謝! –

+0

這很好,謝謝! –

2

解在TXR語言:

@(repeat) 
@ (cases) 
File path: @*[email protected] 
Inode Num: @inode 
@header 
@ (collect) 
@hashline 
@ (last) 
Whole File Hash: @wfh 
@ (end) 
@ (output) 
File path: @[email protected] 
Inode Num: @inode 
@header 
@  (repeat) 
@{hashline 88}[email protected] 
@  (end) 
Whole File Hash: @wfh 
@ (end) 
@ (or) 
@other 
@ (do (put-line other)) 
@ (end) 
@(end) 

執行命令

$ txr suffixes.txr data 
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar 
Inode Num: 22525898 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
45:97:2a:60:e3:69    3208     10        .jar 
7a:8b:8e:20:7b:38    1982     10        .jar 
b9:45:3d:f4:97:88    1849     10        .jar 
Whole File Hash: 865999b40fd9 

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c 
Inode Num: 31881221 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
e8:b0:cb:6f:76:ff    1344     10        .c 
19:c5:b2:aa:b3:60    613      10        .c 
11:7c:7e:76:4b:d5    1272     10        .c 
36:e0:59:49:b6:4a    581      10        .c 
9c:31:bc:8a:39:94    3296     10        .c 
01:f0:56:3a:e1:a9    1140     10        .c 
Whole File Hash: 4b28b44ae03d 
0

在AWK:

$ cat script.awk 
/File path/ { 
    match($0,/\..+/) 
    ext=substr($0,RSTART,RLENGTH) 
} 
/Chunk Hash/ { 
    flag=1   # flag on 
    print    # print here to... 
    next    # avoid printing ext 
} 
/Whole File Hash:/ { 
    flag=0   # flag off 
} 
flag==1 { 
    print $0, ext  # add space here to your liking, left it short... 
    next    # ... to show output on screen without sidescrolling 
} 1     # print non-flagged records 

執行命令

$ awk -f script.awk data.txt 
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar 
Inode Num: 22525898 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
45:97:2a:60:e3:69    3208     10 .jar 
7a:8b:8e:20:7b:38    1982     10 .jar 
b9:45:3d:f4:97:88    1849     10 .jar 
Whole File Hash: 865999b40fd9 

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c 
Inode Num: 31881221 
Chunk Hash      Chunk Size (bytes)  Compression Ratio (tenth) 
e8:b0:cb:6f:76:ff    1344     10 .c 
19:c5:b2:aa:b3:60    613      10 .c 
11:7c:7e:76:4b:d5    1272     10 .c 
36:e0:59:49:b6:4a    581      10 .c 
9c:31:bc:8a:39:94    3296     10 .c 
01:f0:56:3a:e1:a9    1140     10 .c 
Whole File Hash: 4b28b44ae03d 
0
awk --re-interval ' 
/^File/{         #If the beginning of line matches "File" 
    s=gensub("[^.]+(.*)","\\1","1",$0); #Gain the keywords like ".c,.jar" and assign them to s 
} 
/(..:){3,}/{        #If line matches "**:" three times or more 
    gsub("[0-9]+$","&\t\t\t\t\t" s,$0) #At the end add s 
} 
1' file         #Print line 
+0

對不起,我的英文不好,很難表達。我會盡量寫一些解釋。 – zxy