使用AWK合併兩個數據集

我有兩個數據文件：file01和file02。在這兩個數據集中，字段是：（i）識別器; （ii）數字參考; （iii）經度;和（四）緯度。對於file01中的每一行，我想使用相同的數字參考搜索file02中的數據，然後在file02中找到最接近file01中的標識符的標識符。使用AWK合併兩個數據集

我能得到這個，如果我使用下面的代碼從file01通過手動值awk程序：

awk 'function acos(x) { return atan2(sqrt(1-x*x), x) } 
BEGIN {pi=3.14159; 
     ndist=999999999.1; 
     date=1001; 
     lo1=-1.20; lg1=lo1*(pi/180); 
     la1=30.31; lt1=la1*(pi/180) 
      } 
{if($2==date) {ws=$1; 
       lg2=$3*(pi/180); 
       lt2=$4*(pi/180); 
       dist= 6378.7 * acos(sin(lt1)*sin(lt2) + cos(lt1)*cos(lt2)*cos(lg2-lg1)); 
       if(dist < ndist) {ndist=dist; ws0=ws}}} 
END {print(ws0,ndist)}' file02

正如你看到的，date，lo1並在BEGIN聲明la1是美國的價值觀第一行file01（數據文件見下文）。我的問題是，如果我可以一次做到這一點，所以每次我讀file01中的一行時，我會得到最近的標識符和距離，並追加到file01的行數據中。我不知道一些shell命令是否可以通過更簡單的方式實現這一點，也許可以使用管道。

這兩個數據文件和所希望的輸出的一個例子是：

=== file01 ===

A 1001 -1.2 30.31 
A 1002 -1.2 30.31 
B 1002 -1.8 30.82 
B 1003 -1.8 30.82 
C 1001 -2.1 28.55

=== file02 ===

ws1 1000 -1.3 29.01 
ws1 1001 -1.3 29.01 
ws1 1002 -1.3 29.01 
ws1 1003 -1.3 29.01 
ws1 1004 -1.3 29.01 
ws1 1005 -1.3 29.01 
ws2 1000 -1.5 30.12 
ws2 1002 -1.5 30.12 
ws2 1003 -1.5 30.12 
ws2 1004 -1.5 30.12 
ws2 1005 -1.5 30.12 
ws3 1000 -1.7 29.55 
ws3 1001 -1.7 29.55 
ws3 1002 -1.7 29.55 
ws3 1003 -1.7 29.55 
ws3 1004 -1.7 29.55 
ws3 1005 -1.7 29.55 
ws4 1000 -1.9 30.33 
ws4 1001 -1.9 30.33 
ws4 1002 -1.9 30.33 
ws4 1003 -1.9 30.33 
ws4 1004 -1.9 30.33 
ws4 1005 -1.9 30.33

= ==輸出文件===

A 1001 -1.2 30.31 ws4 67.308 
A 1002 -1.2 30.31 ws2 35.783 
B 1002 -1.8 30.82 ws4 55.387 
B 1003 -1.8 30.82 ws4 55.387 
C 1001 -2.1 28.55 ws1 85.369

編輯＃1：考慮通過@Eran的建議，我寫了下面的代碼：

join -j 2 < (sort -k 2,2 file01) < (sort -k 2,2 file02) | 
awk 'function acos(x) { return atan2(sqrt(1-x*x), x) } 
    BEGIN {pi=3.14159} 

    {if (last != $1 $2) 
     {print NR, id,r,lon,lat,ws0,ndist; 
      last = $1 $2; 
      ndist=999999999.1 

     } else { 

      lg1=$3*(pi/180); 
      lt1=$4*(pi/180); 
      lg2=$6*(pi/180); 
      lt2=$7*(pi/180); 
      dist= 6378.7 * acos(sin(lt1)*sin(lt2) + cos(lt1)*cos(lt2)*cos(lg2-lg1)); 
      if(dist< ndist) {ndist=dist; ws0=$5} 
      id=$2;r=$1;lon=$3;lat=$4 

      } 
    }'

從這個腳本的輸出是：

1  
4 A 1001 -1.2 30.31 ws4 67.3078 
7 C 1001 -2.0 28.55 ws3 115.094 
11 A 1002 -1.2 30.31 ws2 35.7827 
15 B 1002 -1.8 30.82 ws4 55.387

編輯＃2：使用的athe建議@丹尼斯（有一些修改）我已經得到了期望的輸出。 awk腳本如下：


awk 'function acos(x) { return atan2(sqrt(1-x*x), x) } 
    BEGIN {pi=3.14159} 
    NR==FNR {c++; a1[c]=$1;a2[c]=$2;a3[c]=$3;a4[c]=$4; next} 
      {d++; b1[d]=$1;b2[d]=$2;b3[d]=$3;b4[d]=$4} 

    END { 
    for(k=1;k<=c;k++) { 
     lg1=a3[k]*(pi/180); 
     lt1=a4[k]*(pi/180); 
     ndist=999999999.1; 
     for(l=1;l<=d;l++) { 
      if(b2[l]==a2[k]) {kk=b2[l]; 
       lg2=b3[l]*(pi/180); 
       lt2=b4[l]*(pi/180); 
       dist= 6378.7 * acos(sin(lt1)*sin(lt2) + cos(lt1)*cos(lt2)*cos(lg2-lg1)); 
       if(dist&ltndist) {ndist=dist; ws0=b1[l]} 
      } 
     } 
     print a1[k],a2[k],a3[k],a4[k],ws0,ndist 
    } 
    }' file01 file02

來源

2012-05-09 Manuel Ramón

將您的值從file01讀入一個或多個數組。您可以在BEGIN塊中使用getline，或者規範的方式是使用FNR == NR循環作爲主要塊之一。

FNR == NR {array[$1] = $1; ...; next } # read file01 into some arrays 
{ for (item in array) { ... }  # process each item in the array(s) against each line in file02

你的腳本將被調用爲awk '...' file01 file02

而是通過字段值索引的數組，你可以用計數器array1[c] = $1; array2[c] = $2; c++對其進行索引以及與計數器而不是使用迭代in：for (i=0; i<c; i++)。

當然，你應該選擇有意義的數組名。

來源

2012-05-09 16:04:23

+1謝謝@Dennis，你的回答非常有幫助！ –

有趣的挑戰。既然你必須首先在file02中讀取數據，並將數據存儲在數據結構中，那麼我首先會傾向於使用Perl。

#!/usr/bin/perl 
use strict; 
use warnings; 

# see http://perldoc.perl.org/Math/Trig.html 
use Math::Trig qw(great_circle_distance deg2rad); 
sub NESW {deg2rad($_[0]), deg2rad(90-$_[1])} 

# read file02 
my %data; 
my $file2 = 'file02'; 
open my $fid, '<', $file2 or die "can't read $file2: $!\n"; 
while (<$fid>) { 
    my ($id, $ref, $long, $lat) = split; 
    push @{$data{$ref}}, [$id, $long, $lat]; 
} 
close $fid; 

$, = " "; 

# process file01 
my $file1 = 'file01'; 
open $fid, '<', $file1 or die "can't read $file1: $!\n"; 
while (<$fid>) { 
    my ($id, $ref, $long, $lat) = split; 
    my @here = NESW($long, $lat); 
    my $min = 99_999_999; 
    my (@min_id, $dist); 

    while (my ($key, $listref) = each %data) { 
     next unless $key == $ref; 

     foreach my $trioref (@$listref) { 
      my ($other_id, $other_long, $other_lat) = @$trioref; 
      my @there = NESW($other_long, $other_lat); 
      $dist = great_circle_distance(@here, @there, 6378.7); 
      if ($dist < $min) { 
       $min = $dist; 
       @min_id = @$trioref; 
      } 
     } 
    } 

    printf "%s %d %s %s %s %6.3f\n", $id, $ref, $long, $lat, $min_id, $min; 
} 
close $fid;

此輸出

A 1001 -1.2 30.31 ws4 67.308 
A 1002 -1.2 30.31 ws2 35.783 
B 1002 -1.8 30.82 ws4 55.387 
B 1003 -1.8 30.82 ws4 55.387 
C 1001 -2.1 28.55 ws1 93.361

我注意到「C」距離是你有什麼建議它應該是不同的。

來源

2012-05-09 16:47:06

謝謝@glenn的幫助。我對Perl不熟悉，所以我需要更多時間來檢查你的建議，但它似乎是一個很好的幫助。 –

要馬上做，運行

join -j 2 <(sort -k 2,2 file01) <(sort -k 2,2 file02)

它管的AWK其在參考每一個變化做了計算：

gawk '{if (last != $1 $2) {calc_nearest_on_array; last=$1 $2; add_point_to_array} else {add_point_to_array}}'

來源

2012-05-10 05:13:14

你的嘗試看起來很有希望。然而，當我嘗試這樣做時，我遇到了一個問題：由於我每次都計算最近的'last！= $ 1 $ 2'，並且在讀取第一行'last'與'$ 1'和'$ 2'不同時，I不要在第一行獲取最近的點，並且在'file01'中沒有打印最後一行。請查看我編輯的問題以查看我的代碼。 –

你是對的。我寫了一個'概念'awk來展示這個想法。爲了打印最後一行，你需要使用END子句。 –

TXR：

@(do 
    (defvar pi 3.1415926535) 
    (defvar earth-radius 6378.7) 
    (defun rad (deg) (/ (* deg pi) 180)) 
    (defun sphere-distance (lat0 lon0 lat1 lon1) 
    (let ((lat0 (rad lat0)) (lat1 (rad lat1)) 
      (lon0 (rad lon0)) (lon1 (rad lon1))) 
     (* earth-radius (acos (+ (* (sin lat0) (sin lat1)) 
           (* (cos lat0) (cos lat1) (cos (- lon1 lon0))))))))) 
@(next "file01") 
@(collect) 
@id @ref @lon0 @lat0 
@ (filter :tonumber lon0 lat0) 
@ (next "file02") 
@ (bind (min-dist ws) (1e99 nil)) 
@ (collect) 
@ws1 @ref @lon1 @lat1 
@ (filter :tonumber lon1 lat1) 
@ (do (let ((d (sphere-distance lat0 lon0 lat1 lon1))) 
      (cond ((< d min-dist) 
        (set min-dist d) 
        (set ws ws1))))) 
@ (end) 
@ (do (format t "~a ~a ~0,2f ~0,2f ~a ~0,3f\n" id ref lon0 lat0 ws min-dist)) 
@(end)

運行：

$ txr dist.txr 
A 1001 -1.20 30.31 ws4 67.308 
A 1002 -1.20 30.31 ws2 35.783 
B 1002 -1.80 30.82 ws4 55.387 
B 1003 -1.80 30.82 ws4 55.387 
C 1001 -2.10 28.55 ws1 93.361

來源

2012-05-11 07:09:27 Kaz

什麼是TXR腳本？我的debian發行版沒有TXR。 –

TxR主頁是：http://www.nongnu.org/txr。沒有Debian軟件包維護者。 – Kaz

使用AWK合併兩個數據集

回答

相關問題