尋找的文件

假設一個有效的分割，這是我的文件：尋找的文件

$ cat file.txt 
A:1:i 
B:2:ii 
X:9:iv

有了一個for循環這樣我可以打印單獨所有字段和重定向到子文件

$ for i in $(seq 1 3); do echo $i; awk -F ":" -v FL=$i '{print $FL}' file.txt > $i.out; done

所以即：

$ cat 1.out 
A 
B 
X 

$ cat 2.out 
1 
2 
9 

$ cat 3.out 
i 
ii 
iv

問：我執行此的近70列和文件大小Ø f近10 GB。它有效，但速度很慢。任何人都可以提出一個更好/高效的拆分來處理這個大數據集。謝謝。

$ for i in $(seq 1 70); do echo $i; awk -F ":" -v FL=$i '{print $FL}' *.data > $i.out; done

來源

2012-09-12 Jadu Saikia

您是否僅限於shell腳本？ – jedwards

謝謝jedwards。即使python也很好。 –

這應該是相當快考慮你正在嘗試做的事：

awk -F: '{ for (i=1; i<=NF; i++) print $i > i".out" }' file.txt

來源

2012-09-12 04:15:05 Steve

簡單而優雅。做得很好。唯一的好處就是，它可以在gawk中運行，但不能用於BSD awk（FreeBSD，OSX等默認的awk）。我相信awk不喜歡你的重定向目標。它使用'out = i「.out」;打印$ i> out;'雖然。 – ghoti

Thanks @ ghoti :-) – Steve

在Perl中，你可以這樣做：

#!/usr/bin/perl -w 
my $n = 3; 
my @FILES; 
for my $i (1..$n) { 
    my $f; 
    open ($f, "> $i.out") or die; 
    push @FILES, $f; 
} 
while (<>) { 
    chomp; 
    @a = split(/:/); 
    for my $i (0..$#a) { 
    print $FILES[$i] $a[$i],"\n"; 
    } 
} 
close($f) for $f in @FILES;

來源

2012-09-12 04:22:05 epsalon

Python版本

#!/bin/env python 

with open('file.txt', 'r') as ih: 
    while True: 
     line = ih.readline() 
     if line == '': break 
     for i,element in enumerate(line.strip().split(':')): 
      outfile = "%d.out" % (i+1) 
      with open(outfile, 'a') as oh: 
       oh.write("%s\n" % element)

這可能是b更快一點，因爲它只經過一次原始文件。請注意，可以通過保持輸出文件處於打開狀態來進一步優化它（因爲它是，我關閉它們中的每一個，並在每次寫入時重新打開它們）。

編輯

例如，像：

#!/bin/env python 

handles = dict() 

with open('file.txt', 'r') as ih: 
    while True: 
     line = ih.readline() 
     if line == '': break 
     for i,element in enumerate(line.strip().split(':')): 
      outfile = "%d.out" % (i+1) 

      if outfile not in handles: 
       handles[outfile] = open(outfile, 'a'); 

      handles[outfile].write("%s\n" % element) 

for k in handles: 
    handles[k].close()

這使得開放的執行期間的手柄，然後繼續/結束前關閉所有。

來源

2012-09-12 04:25:57 jedwards

打開和關閉這樣的文件可能會非常昂貴，尤其是在網絡文件系統上。 – epsalon

@epsalon，我同意 - 我編輯的答案包括一個版本，使文件保持打開狀態。 – jedwards

隨着coreutils的，如果你知道有三列：

< file.txt tee >(cut -d: -f1 > 1.out) >(cut -d: -f2 > 2.out) >(cut -d: -f3 > 3.out) > /dev/null

爲了使它更通用，這裏的自動命令行生成一種方式：

# Determine number of fields and generate tee argument 
arg="" 
i=1 
while read; do 
    arg="$arg >(cut -d: -f$i > $((i++)).out)" 
done < <(head -n1 file.txt | tr ':' '\n')

arg現在是：

>(cut -d: -f1 > 1.out) >(cut -d: -f2 > 2.out) >(cut -d: -f3 > 3.out)

保存到腳本文件：

echo "< file.txt tee $arg > /dev/null" > script

，並執行：

. ./script

來源

2012-09-12 08:49:05 Thor

這是一個使用一個功能，我不經常看到一個bash腳本：問慶典分配文件描述符的文件和存儲描述符中的變量：

# Read the first line to get a count of the columns 
IFS=: read -a columns < file.txt 

# Open an output file for each column, saving the file descriptor in an array 
for c in "${columns[@]}"; do 
    exec {a}>$((++i)).txt 
    fds+=($a) 
done 

# Iterate through the iput, writing each column to the file opened for it 
while IFS=: read -a fields; do 
    for f in "${fields[@]}"; do 
     printf "$f\n" >&${fds[++i]} 
    done 
done < file.txt 

# Close the file descriptors 
for fd in "${fds[@]}"; do 
    exec {fd}>&- 
done

來源

2012-09-12 13:11:29 chepner

回答

相關問題