比較兩個CSV文件並生成第三個文件

所以最近我每個月都會在工作中對患者進行驗證檢查。需要幾天到過去幾個月的驗證與當前月份的比較：比較兩個CSV文件並生成第三個文件

SeptemberVal.CSV

Gender MRN  Operation  Consultant TCI Date ... ... ... 
    Male 738495  CIRC  Dr Yates 05.12.13 ... ... ... 
    Female 247586 Cystoscopy Dr Know  10.12.13 ... ... ... 
    Male 617284  Biopsy  Dr Yates 25.12.13 ... ... ...

OctoberVal.CSV

Gender MRN  Operation  Consultant TCI Date ... ... ... 
    Male 491854  Biopsy  Dr Yates 05.12.13 ... ... ... 
    Female 247586 Cystoscopy Dr Know  10.12.13 ... ... ... 
    Female 285769  Biopsy  Dr Yates 25.12.13 ... ... ... 
    ...  ...   ...   ...   ...  ... ... ...

Output.csv

Gender MRN  Operation  Consultant TCI Date ... ... ... 
    Female 247586 Cystoscopy Dr Know  10.12.13 ... ... ... 
    ...  ...  ...   ...   ...  ... ... ...

我想創建一個perl腳本，用於比較SeptermberVal.csv和「OctoberVal.csv」的「MRN」列和n一旦找到匹配項，我希望它將來自SeptemberVal.CSV的整個匹配行復制並粘貼到新文件中。

每個驗證表格可能有800位患者，而且很多可以從前一個月繼續，因此下個月我會說有900位患者驗證400可能是以前的形式，其餘的都是新的。

這是可能的Perl，如果是這樣我會去呢？如果有人有如何做到這一點的任何示例代碼，我將不勝感激。我想從長遠來看選擇Perl，因爲它在工作社區中廣泛使用。

來源

2013-11-27 Marshal

文件是否包含製表符分隔的數據？ – Kenosis

存在[CSV的DBD驅動程序]（http://search.cpan.org/~hmbrand/DBD-CSV-0.41/lib/DBD/CSV.pm）文件，它也支持SQL連接。 – ceving

@Kenosis我相信主要是逗號。 – Marshal

在*nix: perform set union/intersection/difference of lists有一個perl的例子。您必須對其進行調整，以便僅查看MRN列進行測試。

來源

2013-11-27 19:50:06 jez

你應該嘗試的unix命令join

join，您可以：

選擇字段分隔符（逗號）;
選擇用於連接（2）的字段;
格式輸出（行從SeptemberVal.CSV）

來源

2013-11-27 21:00:02 Pierre

在這裏你去 - 這應該爲你做它非常精美，很容易讀取和修改過。

#!/usr/bin/perl 
################################################################################ 
# File: ProcessMRNs 
# Author: Mark Setchell 
# stackoverflow.com/questions/20251625/perl-comparing-two-csv-files-and-producing-a-third 
################################################################################ 
use strict; 
use warnings; 
use Data::Dumper; 

    my $Debug=0; # Set to 1 for debug output 

    # Check user has supplied last month and this month's CSV file 
    if($#ARGV !=1){ 
     print "Usage: $0 <last_monthCSV> <this_monthCSV>\n"; 
     exit 1; 
    } 

    # Pick up CSV filenames from parameters 
    my $lastmonth=$ARGV[0]; 
    my $thismonth=$ARGV[1]; 

    # Hash to keep last month's records in, indexed by MRN 
    my %prevMRNs; 
    my $header; 

    # Open last month's file and read into hash indexed by MRN 
    my $MRN; 
    open(FH,"<",$lastmonth) or die "Unable to open $lastmonth"; 
    while(<FH>){ 
     chomp;    # Remove end of line junk 
     (undef,$MRN,undef)=split(" "); # Extract MRN from line 
     # Save table header if this is it 
     if($MRN =~ /MRN/){ 
      $header=$_; 
      next; 
     } 
     print "DEBUG: Read last month MRN:$MRN\n" if $Debug; 
     # Save this MRN into our hash of records, indexed by MRNs 
     $prevMRNs{$MRN}=$_; 
    } 
    close FH; 

    # Show user what we got from last month's CSV 
    print Dumper \%prevMRNs if $Debug; 

    # Now open this month's file 
    open(FH,"<",$thismonth) or die "Unable to open $thismonth"; 
    print "$header\n"; 
    while(<FH>){ 
     chomp;    # Remove end of line junk 
     (undef,$MRN,undef)=split(" "); # Extract MRN 
     next if $MRN =~ /MRN/;  # Ignore header line 
     print "DEBUG: Read current month MRN:$MRN\n" if $Debug; 
     # THIS IS THE CRITICAL LINE IN THE WHOLE SCRIPT 
     # If we saw this MRN last month, print what we saw 
     print "$prevMRNs{$MRN}\n" if defined $prevMRNs{$MRN}; 
    } 
    close FH;

這裏是沒有調試輸出：

Gender MRN  Operation  Consultant TCI Date ... ... ... 
    Female 247586 Cystoscopy Dr Know  10.12.13 ... ... ...

下面是調試輸出：

DEBUG: Read last month MRN:738495 
DEBUG: Read last month MRN:247586 
DEBUG: Read last month MRN:617284 
$VAR1 = { 
      '247586' => ' Female 247586 Cystoscopy Dr Know  10.12.13 ... ... ...', 
      '617284' => ' Male 617284  Biopsy  Dr Yates 25.12.13 ... ... ...', 
      '738495' => ' Male 738495  CIRC  Dr Yates 05.12.13 ... ... ...' 
     }; 
    Gender MRN  Operation  Consultant TCI Date ... ... ... 
DEBUG: Read current month MRN:491854 
DEBUG: Read current month MRN:247586 
    Female 247586 Cystoscopy Dr Know  10.12.13 ... ... ... 
DEBUG: Read current month MRN:285769

假設你將它保存爲「ProcessMRNs」，你做這個運行：

chmod +x ProcessMRNs 
./ProcessMRNs september.csv october.csv

如果你希望輸出到文件而不是屏幕，添加「>輸出。TXT」像這樣的結尾：

./ProcessMRNs september.csv october.csv > output.txt

來源

2013-11-28 12:35:00

這個工作適合你嗎？如果是這樣，你可以用一個可愛的大綠蜱接受我的答案嗎？如果不是，請說出錯的地方，以便我/他人可以進一步幫助您。 –

只是爲了好玩，這裏是另一個（簡單）答案：

awk 'FNR==NR{a[$2]=$0;next}{if ($2 in a)print a[$2]}' september.csv october.csv

與結果：

Gender MRN  Operation  Consultant TCI Date ... ... ... 
Female 247586 Cystoscopy Dr Know  10.12.13 ... ... ...

這工作完全一樣Perl解決方案，但使用awk的關聯數組（如Perl的哈希），也是處理2個輸入文件的技巧，即september.csv和october.csv。

「FNR == NR」部分（直到「下一個」）適用於處理第一個文件，並且對於它在該文件中找到的每個記錄，它將整個記錄（$ 0）保存在關聯數組中（「a 「）由MRN索引（2場，或2美元）。然後（從「if」開始）它處理第二個文件（october.csv）並且說「如果這個MRN（字段2或$ 2）在數組」a「中（從第一遍到september.csv ）然後打印任何行中，我們發現了這個MRN在這一點上。

來源

2013-11-28 13:37:24

如何是你的Perl？

首先，你應該使用類似Text::ParseWords或Text::CSV在你的文件中讀取。這兩種處理欄目化文件並處理引號。Text::CSV是最流行的，但Text::ParseWords自帶Perl，所以它始終可用。

是否MRN每個文件都是唯一的編號？如果是這樣，您可能希望將其作爲密鑰用於您的數據結構。你將不得不知道如何在Perl中使用引用，所以如果你不瞭解Perl引用，請閱讀tutorial。

認爲你的文件的每一行由MRN號碼被鍵入，每行作爲參考，以另一種散列，其中每列由列的名稱鍵控的：

$october{738495}->{gender}  = "M"; 
$october{738495}->{operation} = "CIRC"; 
$october{738495}->{consultant} = "Dr Yates"; 
$october{738495}->{tci_date} = "05.12.13";

現在，你可以通過這個結構，九月，拉，如果你有同年10 MRI：

for my $mri (sort keys %september) { 
    if (exists $october{$mri}) {  # Similar MRI found in September and October 
     if ($september{$mri}->{gender} eq $october{$mri}->{gender} 
      and $september{$mri}->{consultant} eq $september{$mri}->{consultant} 
      ...) { 
      .... 
     else { 
      .... 
     } 
    } 
}

如果你知道面向對象的Perl，你應該使用，並幫助恢復正常的東西像性別和顧問姓名，日期等。

來源

2014-12-25 21:16:34

比較兩個CSV文件並生成第三個文件

回答

相關問題