Perl函數提取數據與指定的起始列和長度

我想寫一個代碼，從文件A中提取數據，並將指定的開始和結束點的列數據粘貼到文件B.到目前爲止，我只能成功將所有數據從A複製到B - 但是無法在任何地方過濾出列。我試圖尋找拼接和grep無濟於事。在Perl中沒有經驗。數據沒有列標題。樣品：該數據實際上是上千行長 - 不能將數據插入功能Perl函數提取數據與指定的起始列和長度

1. AAA 565 u8y 221 
2. ABC 454 9u8 352 
3. ADH 115 i98 544 
4. AKS 352 87y 454 
5. GJS 154 i9k 141

我想要的第三列的所有唯一值（開始：8長度：3）被複制到文件B 。我試過How to extract a particular column of data in Perl?提供的解決方案無濟於事。

感謝您的任何提示或幫助！

#!/usr/bin/perl 
use strict; 
use warnings; 

#use Cwd qw(abs_path); 

#my $dir = '/home/ 
#$dir = abd_path($dir); 
my $filename = "filea.txt"; 
my $newfilename = "fileb.txt"; 

#Open file to read raw data 
open (DATA1, "<$filename") or die "Couldn't open $filename: $!"; 

#Open new file to copy desired columns 
open (DATA2, ">$newfilename") or die "Couldn't open $newfilename: $!"; 

#Copy data from original to new file 

while (<DATA1>) { 
    #DATA2=splice(DATA1, 0,5); 
    print DATA2 $_; 
    my @fifth_column = map{(split)[1]} split /\n/, $newfilename;  
}

來源

2013-02-26 todayspresent

如果我理解正確，您可以使用一個相當簡單的腳本。

use strict; 
use warnings; 

my %seen; 
while (<DATA>) { 
    my $str = substr($_, 8, 3); # the string you seek 
    unless ($seen{$str}++) {  # if it is not seen before 
     print "$str\n";   # ...print it 
    } 
} 

__DATA__ 
AAA 565 u8y 221 
AAA 565 u8y 221 
ABC 454 9u8 352 
ADH 115 i98 544 
AKS 352 87y 454 
GJS 154 i9k 141

輸出：

u8y 
9u8 
i98 
87y 
i9k

的DATA文件句柄用於演示在這裏。我還在數據中添加了重複內容以演示重複數據刪除。如果更改<DATA>到<>，你可以簡單地使用腳本像這樣：

perl script.pl filea.txt > fileb.txt

注意，這依賴於你的數據是固定寬度，也就是說，如果你的域不排隊，你的輸出就會被損壞。

另外請注意，這是一個簡單的一行的只是一個完整的版本像這樣：

perl -nlwe '$x=substr($_,8,3); print $x unless $seen{$x}++' filea.txt > fileb.txt

來源

2013-02-26 16:53:16 TLP

單線作品很棒！非常感謝你通知我這個選擇！它確實吐出了下面的警告 - 即使它給了我正確的輸出。 @TLP名稱「main :: seen」僅用於一次：在-e行1處可能出現錯字。 – todayspresent 2013-02-26 19:30:17

想要添加 - 我不是非常擔心警告......只是一個FYI @TLP – todayspresent 2013-02-26 19:36:20

@todayspresent該警告不相關。這只是因爲我們實際上一次只做兩件事，'除非$ seen {$ x} ++'真的很短'，除非$ seen {$ x}; $看出{$ X} ++;'。再加上單行不使用嚴格的事實，所以我們不聲明變量。如果您覺得您的問題已得到解答，您可以點擊旁邊的複選標記接受答案。 – TLP 2013-02-26 19:48:34

看看下面的Perl命令：

split：允許你將一行數據分成一個數組：

例如：

while (my $line = <$input_fh>) { 
    my @items = split /\s+/, $line; #Columns are separated by spaces or tabs 
    my $third_column = $items[2]; #The column you want; 
    blah...blah...blah; 
}

substr：這允許你指定你的列信息的字符串。如果您的列由製表符分隔，這可能不會有用。對於大多數非Perl開發人員來說，這是他們嘗試的第一種方法。不過，我建議使用split。

確保您的數據是唯一的有一個Perl技巧：使用散列來存儲您的信息。在散列中查找數據很快，並且可以使用exists函數快速查找是否已經看到該數據。與split結合本：

use strict; 
use warnings; 
use autodie; 

use constants { 
    INPUT_FILE => "filea.txt", 
    OUTPUT_FILE => "fileb.txt", 
}; 

open my $input_fh, "<", INPUT_FILE; 
open my $output_fh ">", OUTPUT_FILE; 

my %unique_columns; 
while (my $line = <$input_fh>) { 
    my @items = split /\s+/, $line; #Columns are separated by spaces or tabs 
    my $third_column = $items[2]; #The column you want; 
    if (not exists $unique_columns{$third_column}) { 
     $unique_columns{$third_column} = 1; 
     print {$output_fh} "$third_column\n"; 
    } 
} 
close $output_fh;

的%unique_columns哈希曲目，看看你是否已經看過你的文件的第三列的數據。無論你設置每個單獨的密鑰是什麼。不過，我建議，因爲如果你沒有將其設置爲非零或空白值：

if ($unique_columns{$data})

，而不是

if (exists $unique_columns{$data})

程序仍會工作，只要$unique_columns{$data} ISN」的值零或空白，但否則失敗。

來源

2013-02-26 17:47:32

這非常有幫助 - 我將不得不尋找哪些線路來替換「blahs」和「yaddas」 - 感謝您的解釋！ :-) @David W. – todayspresent 2013-02-26 19:41:21

yaddas將是： print DATA2 $ third_column; – sventechie 2013-02-26 20:40:13

@todayspresent - 我已更新我的答案以顯示整個程序。 – 2013-02-26 22:17:23

說到固定長度，沒有什麼可以打包/解壓，學習這個教程，它會讓你的生活更輕鬆，這份工作是小菜一碟。

http://linux.die.net/man/1/perlpacktut

來源

2013-02-26 20:29:16

模板'（A4）*'應該解開你的字符串。 – fenway 2013-02-27 06:55:11

Perl函數提取數據與指定的起始列和長度

回答

相關問題