2011-07-06 53 views
2

Unix在一個公共列上連接兩個文件需要(據我所知)先排列公共列上的兩個文件。如果這是正確的行排序丟失。我想保留第一個文件的行順序。爲了做到這一點,我在第一個文件中添加了一個鬼列,其中包含每行的行號。然後,我將這兩個文件排列在公共列上,然後重新排序ghost列上的輸出並將其刪除。我包括一個腳本來做到這一點。還有其他更好或更快的方法嗎?unix文件加入 - 如何保留行順序?

#!/bin/bash 

input_file1= 
input_file2= 
output_file='/dev/stdout' 
key1= 
key2= 
ifs_tab=0 
rand=$$ 
key_field_type="" # flag to sort -g or -n or emtpy for numerical (-n) or general or general numeric sort (-g) 
appname=`basename "$0"` 

function print_help_and_exit { 
    echo "Usage : $appname -1 key1 -2 key2 [-t] [-n|-g] file1 file2 [>output]" 
    echo "key1: the join column from the first input file (column numbers start from 1)" 
    echo "key2: the join column from the second input file" 
    echo "optional flag -t uses a single tab as a field separator as opposed to a sequence of white space (which is the default)" 
    echo "-n or -g : flags to be passed to sort: -n sort in numeric order, -g sort in general numeric order, default: text, leave empty" 
    echo "script by Andreas Hadjiprocopis/Institute of Cancer Research, 2011" 
    exit 1 
} 
    while getopts "1:2:o:tnh" OPTION; do 
    case $OPTION in 
      1) 
        key1="${OPTARG}" 
        ;; 
      2) 
        key2="${OPTARG}" 
        ;; 
      o) 
        output_file="${OPTARG}" 
        ;; 
      t) 
        ifs_tab=1 
        ;; 
      n) 
        key_field_type="-n" 
        ;; 
      g) 
        key_field_type="-g" 
        ;; 
      h) 
        print_help_and_exit 
        ;; 
    esac 
done 
shift $(($OPTIND - 1)) 
input_file1=$1; shift 
input_file2=$1; shift 

if [ "$key1" == "" ] || [ "$key2" == "" ] || [ "$input_file1" == "" ] || [ "$input_file2" == "" ]; then 
    echo "$appname : incorrect number of parameters" > /dev/stderr 
    print_help_and_exit 
fi 
if [ ${ifs_tab} -eq 1 ]; then ifs1="-t$'\t'"; ifs2="-F $'\t'"; else ifs1=""; ifs2=""; fi 
# note: when you do a join the output file contains the common column first, then all the columns of the first file, then all from second file 

# add a new column to the beginning of the input_file1 and increment its join-column number (key1) 
# then we will sort the two input files as required by join 
# then we will join the two input files on the specified column numbers (key1 and key2) 
# then we will sort the output according to the new column we added 
# and then delete that column, output to STDOUT 

let key1++ 
cat << EOC | sh 
awk ${ifs2} '{print NR"\t"\$0}' "${input_file1}" | sort -k ${key1} ${ifs1} ${key_field_type} > /tmp/${rand}.1 
sort ${ifs1} -k ${key2} ${key_field_type} "${input_file2}" > /tmp/${rand}.2 
join ${ifs1} -1 ${key1} -2 ${key2} /tmp/${rand}.1 /tmp/${rand}.2 | sort ${ifs1} -k 1 -n | awk ${ifs2} '{str=\$1;for(i=3;i<=NF;i++) str=str"\t"\$i; print str}' > "${output_file}" 
EOC 

rm -f /tmp/${rand}.* 
exit 0 

回答

1

這裏有一些建議:

  • 你並不需要創建臨時文件。改爲使用process substitution
  • 你不需要cat << EOC | sh

你的三個命令可以組合成一個單一的管道:

join ${ifs1} -1 ${key1} -2 ${key2} \ 
    <(awk ${ifs2} '{print NR"\t"$0}' "${input_file1}" | sort -k ${key1} ${ifs1} ${key_field_type}) \ 
    <(sort ${ifs1} -k ${key2} ${key_field_type} "${input_file2}") \ 
| sort ${ifs1} -k 1 -n \ 
| awk ${ifs2} '{str=$1;for(i=3;i<=NF;i++) str=str"\t"$i; print str}' > "${output_file}" 
+0

感謝您的提示。貓「EOC | sh是處理存儲在變量中的分隔符的廉價方式。例如X = 「 - T $ '\噸'」;對$ x <文件進行排序。任何提示 - 即傳遞分隔符來存儲在變量中進行排序和連接? (順便說一句,你的建議應該是cat << EOC | bash)。 – bliako