2015-03-13 67 views
-2

我想用perl腳本處理一些twitter數據集。該文件是以csv格式。Perl腳本:刪除自我重複行

我想刪除自己的尋址提到

的CSV列和數據是這樣,例如

user, mention(user), message 
vims789, vnjuei234, yea this is good 
dfion, youwen12, this is win 
don234, don234, this is green 
wen123, tileas, this is blue 

重複的是"don234, don234"提的本身,該行應被刪除。例如

用戶,提(用戶),消息
vims789,vnjuei234,是啊,這是個好
dfion,youwen12,這是贏得
wen123,tileas,這是藍

+3

你有什麼,到目前爲止,是什麼問題你有嗎? – Sobrique 2015-03-13 18:03:52

+0

我最初嘗試了一些簡單的「sort FILE | uniq -c」,它給出了錯誤的結果; – Richardsop 2015-03-13 23:13:52

回答

0

也許是這樣的:

#!/usr/bin/perl 
use strict; 
use warnings; 

use Text::CSV; 
my $csv = Text::CSV->new(); 

while (my $row = $csv->getline(\*DATA)) { 
    my ($user, $mention, $message) = @$row; 
    print $message,"\n" unless $user eq $mention; 
} 
__DATA__ 
user, mention(user), Message 
vims789, vnjuei234, yea this is good 
dfion, youwen12, this is win 
don234, don234, this is green 
wen123, tileas, this is blue 
+0

如果這實際上是OP所具有的數據,那麼'Text :: CSV'就沒有多大用處,因爲它將字段分隔符限制爲單個ASCII字符。我建議'split/\ s *,\ s * /',因爲我認爲在逗號前總是隻有一個空格並且零空格是可疑的。 – Borodin 2015-03-13 18:32:34

0

您可以通過反向引用非常快速地完成此操作。既然你想再次找到的東西,一個逗號,一些空間,然後的東西,假設字符串將是所有字的字符,這應該工作:

my $regex 
    = qr{^ # beginning of the line 
      (\w+) # A "word" 
      ,  # A comma 
      \s+ # space 
      \1 # a back reference to the first capture. 
      \b # demand that it end the sequence of word characters. 
     }x; 

my @filtered_lines = grep { !m/$regex/ } @lines;