2013-11-09 33 views
0

我無法使用連接產生所需的結果。如何改進GNUwin32連接命令?

我在Windows 7 64位上運行GNUwin32。正在運行加入版本5.3.0.1936和gawk版本3.1.6.2962。

以下兩個表格是輸入:

TABLE_1在一個bat文件使用以下命令

UID_C CID 
C000002 31799 
C000002 31800 
C000386 14950 
C000386 9807916 
C000386 10255083 
C008114 5318432 
C008117 799 
C008117 444150 
C008117 46878464 

TABLE_2

UID_C CID name 
C000002 31799 bevonium 
C000002 31800 bevonium 
C002284 24832095 hypromellose 
C008117 799 indoleglycerol phosphate 
C008117 444150 indoleglycerol phosphate 
C008117 46878464 indoleglycerol phosphate 

AM:

C:\gnuwin32\bin\join -t"|" -1 1 -2 1 -a1 -a2 -e "NULL" -o "0,1.2,2.2,2.3" C:\directory\Table_1.txt C:\directory\Table_2.txt > C:\directory\Table_3.txt 

在我插圖在stackoverflow上,表格使用標籤格式化以便於閱讀,但實際上我使用pipe作爲輸入和輸出分隔符。

下表是輸出:

TABLE_3

UID_C CID CID name 
C000002 31800 31799 bevonium 
C000002 31800 31800 bevonium 
C000002 31799 31799 bevonium 
C000002 31799 31800 bevonium 
C000386 10255083 NULL NULL 
C000386 9807916 NULL NULL 
C000386 14950 NULL NULL 
C002284 NULL 24832095 hypromellose 
C008114 5318432 NULL NULL 
C008117 46878464 799 indoleglycerol phosphate 
C008117 46878464 444150 indoleglycerol phosphate 
C008117 46878464 46878464 indoleglycerol phosphate 
C008117 444150 799 indoleglycerol phosphate 
C008117 444150 444150 indoleglycerol phosphate 
C008117 444150 46878464 indoleglycerol phosphate 
C008117 799 799 indoleglycerol phosphate 
C008117 799 444150 indoleglycerol phosphate 
C008117 799 46878464 indoleglycerol phosphate 

所需的輸出是:

TABLE_4

UID_C CID name 
C000002 31799 bevonium 
C000002 31800 bevonium 
C000386 14950 NULL 
C000386 9807916 NULL 
C000386 10255083 NULL 
C002284 24832095 hypromellose 
C008114 5318432 NULL 
C008117 799 indoleglycerol phosphate 
C008117 444150 indoleglycerol phosphate 
C008117 46878464 indoleglycerol phosphate 

如何更改連接命令產生所需輸出?

或者,我應該如何使用awk作爲Table_3的後期處理來生成Table_4?

在此先感謝您的建議。

回答

2

我認爲你需要更多的邏輯不是加入providex:

awk -F"|" -v "OFS=|" ' 
    NR==FNR {uid_cid[$1 OFS $2]=1; next} 
    { 
     key = $1 OFS $2 
     if (key in uid_cid) { 
      delete uid_cid[key] 
     } 
     print 
    } 
    END { 
     for (key in uid_cid) { 
      print key, "NULL" 
     } 
    } 
' Table_1 Table_2 | sort -k1,1 -k2,2n -t "|" 
C000002|31799|bevonium 
C000002|31800|bevonium 
C000386|14950|NULL 
C000386|9807916|NULL 
C000386|10255083|NULL 
C002284|24832095|hypromellose 
C008114|5318432|NULL 
C008117|799|indoleglycerol phosphate 
C008117|444150|indoleglycerol phosphate 
C008117|46878464|indoleglycerol phosphate 
+0

需要與你的建議一點幫助。 Column_1是UID(爲簡單起見,我們放下尾部下劃線)。 Column_2是CID。 Column_3是名稱。幫助我將它翻譯成awk短語:{uid_cid [$ 1 OFS $ 2] = 1;下一個} _和_鍵在uid_cid中。我還沒有建立連接。 –

+0

在Windows下執行也有困難。我用來自table_1和table_2的輸入運行awk代碼。我將其重定向到一個新的table_3,以確保我理解awk處理。我得到「errcount:1」 –

+0

我使用雙引號(對於Windows .bat)而不是單引號來分隔awk處理。 –