2013-11-25 191 views
4

我有一個逗號分隔的文件,包含12列。從逗號分隔的文件中刪除額外的逗號

第5列和第6列有問題(第5列和第6列的文字完全相同,但它們之間可能有額外的逗號),其中包含額外的逗號。

2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two 

所以在上面的例子「嘿,你好,」不應該有一個逗號。

我需要刪除第5和第6列中的額外逗號。

+0

第4列和第7列總是包含數字? –

+2

如果可能的話,最好在包含逗號的列上使用封裝來正確地重新請求或重新生成csv文件。 例如'2011,123456,1234567,12345678,「你好,你好嗎」,「你好,你好嗎」,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two' – AeroX

回答

4

如果你總是想刪除第五逗號,儘量

sed 's/,//5' input.txt 

但你說,它可能額外的逗號。你必須提供一個邏輯如何找出額外的逗號是否存在。

如果您知道逗號的數量,則可以使用。這已被證明是相當的鍛鍊,我相信別人會拿出一個更優雅的解決方案,但無論如何,我會分享我:

awk -f script.awk input.txt 

與script.awk:

BEGIN{ 
    FS="," 
} 
NF<=12{ 
    print $0 
} 
NF>12{ 
    for (i=1; i<=4; i++) printf $i FS 
    for (j=0; j<2; j++){ 
     for (i=0; i<=(NF-12)/2; i++){ 
      printf $(i+5) 
      if (i<(NF-12)/2) printf "_" 
      else printf FS 
     } 
    } 
    for (i=NF-5; i<=NF; i++) printf $i FS 
    printf "n" 
} 

首先,我們將字段分隔符設置爲,。如果我們計數小於或等於12字段,一切都很好,我們只需打印整行。如果有超過12個字段,我們首先打印前4個字段(再次使用字段分隔符),然後打印兩次字段5(和字段6),但不打印,,我們將其與_進行交換。最後我們打印剩下的字段。

正如我所說,這可能是一個更優雅的解決方案。我想知道其他人出現了什麼。

+0

每行應該有11個逗號(有12列),第5和第6列有額外的逗號。 – Stu

2

如果所有其他字段都是數字字段,則可以嘗試按照該條件保存有用的逗號。

sed -r 's/(,)[0-9]/;/g' a | sed -r 's/[0-9](,)/;/g' | sed -r 's/,//g' | awk -F\; '{ print $1 "," $2 "," $3 "," $4 "," substr($5, 0, length($5)/2) "," substr($5, length($5)/2 +1, length($5)/2) "," $6 "," $7}' 
2011,23456,234567,234567,Hey ThereHow are you,Hey ThereHow are you,8286430903, 
1

可以與及其Text::CSV_XS模塊嘗試:

#!/usr/bin/env perl 

use warnings; 
use strict; 
use Text::CSV_XS; 

my (@columns); 

open my $fh, '<', shift or die; 

my $csv = Text::CSV_XS->new or die; 
while (my $row = $csv->getline($fh)) { 
    undef @columns; 
    if (@$row <= 12) { 
     @columns = @$row; 
     next; 
    } 

    my $extra_columns = (@$row - 12)/2; 
    my $post_columns_index = 4 + 2 * $extra_columns * 2; 
    @columns = ( 
     @$row[0..3], 
     (join('', @$row[4..(4+$extra_columns)])) x 2, 
     @$row[$post_columns_index..$#$row] 
    ); 
} 
continue { 
    $csv->print(\*STDOUT, \@columns); 
    printf "\n"; 
} 

假設與三根線,其中所述第一個具有一個額外的逗號一個輸入文件(infile),第二個具有兩個附加逗號,第三個是正確的:

2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two 
2011,123456,1234567,12345678,Hey There,How are you,now,Hey There,How are you,now,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two 
2011,123456,1234567,12345678,Hey There:How are you,Hey There:How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two 

運行腳本,如:

perl script.pl infile 

國債收益率:

2011,123456,1234567,12345678,"Hey ThereHow are you","Hey ThereHow are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two" 
2011,123456,1234567,12345678,"Hey ThereHow are younow","Hey ThereHow are younow",LABACD,1.00000000,80.2500000,"One Two" 
2011,123456,1234567,12345678,"Hey There:How are you","Hey There:How are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two" 

需要注意的是它增加了一些報價,但它是正確的總部設在csv規範,更容易處理了以前的狀態。