按照一定的標準選擇行

我有一個十列的數據列表，如下所示。它有幾千條線。按照一定的標準選擇行

$1 $2 $3 $4 $5  $6  $7 $8 $9 $10 

| 8455 [email protected] | 8132 [email protected] 8131 [email protected] | 68.43 
| 7490 [email protected] | 8868 [email protected] 8867 [email protected] | 68.30 
| 7561 [email protected] | 9185 [email protected] 9184 [email protected] | 66.83 
| 8776 [email protected] | 7481 [email protected] 7480 [email protected] | 65.55 
| 8867 [email protected] | 8432 [email protected] 8431 [email protected] | 64.48 
| 9832 [email protected] | 6357 [email protected] 6356 [email protected] | 64.44 
| 9194 [email protected] | 5699 [email protected] 5698 [email protected] | 64.06 
| 8849 [email protected] | 5780 [email protected] 5779 [email protected] | 63.99

我想選擇與某些特殊表達式匹配的列$ 3和列$ 6中的行。我希望用作正則表達式的標準是'前面的數字'@「符號在兩列中都是相同的。如果這個標準匹配，比我想要將這些行輸出到一個新文件。

我曾嘗試在AWK像這樣

awk '$3~/[[email protected]]/ {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' hhHB_inSameLayer_065_128-maltoLyo12per.tbl

，但它並沒有給我想要的東西。

我apreciate如果有人可以給一些幫助。

請注意：如果我在perl或python中獲得一些幫助，也感激不盡。

非常感謝提前。

來源

2013-07-06 Vijay

前兩行實際上是否存在於文件中（帶'$ 1'，'$ 2' ...和空行的行），還是隻是爲了說明目的而將它放在那裏？ – doubleDown

對於遲到的回覆很抱歉。實際上原文中只有幾行（大約8行文字）。 – Vijay

你好。你有沒有看到我的答案？ – eyquem

在awk中嘗試以下操作。分裂$3和$6成基於所述@分離器陣列和打印如果每個匹配

awk '{split($3, a, "@"); split($6, b, "@");if (a[1] == b[1]) print}'

或者更慣用

awk '{split($3, a, "@"); split($6, b, "@")}; a[1] == b[1]'

或快速的Python 2.6+溶液的第一元件

from __future__ import print_function 
with open('testfile.txt') as f: 
    for line in f: 
      fields = line.split() 
      fields3 = fields[2].split('@') 
      fields6 = fields[5].split('@') 
      if fields3[0] == fields6[0]: 
        print(line, end='')

來源

2013-07-06 04:28:27 iruvar

我試過了awk命令。它工作正常。感謝1_CR。 – Vijay

@Vijay，很高興知道！請[接受答案]（http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work）如果它適合你。 – iruvar

此外，在$ 3列中，如果我想選擇符號'@'之後的字符（例如O13或O25），我應該做什麼更改？我嘗試了[awk'{split（$ 3，a，「@」）; if（a [1] == O22）print}']。但似乎不工作？ – Vijay

以下是使用內置csv模塊的Python解決方案。它將所有符合條件的行存儲在列表stored_lines中。

** 編輯爲跳過標題並且不將多個空格視爲多個分隔符。 **

import csv 

def is_good(line): 
    return line[2][:line[2].find('@')] == line[5][:line[5].find('@')] 

# we'll put the lines that match the criteria here. 
stored_lines = [] 

with open('stack.txt') as fr: 
    csv_reader = csv.reader(fr, delimiter=' ', skipinitialspace=True) 

    # Skip the header 
    csv_reader.next() 
    csv_reader.next() 
    for line in csv_reader: 
     if is_good(line): stored_lines.append(line) 

print(stored_lines)

來源

2013-07-06 04:40:54

您需要進行一些更改才能使用OP描述的輸入文件進行此項操作。首先，你需要跳過前兩行。我相信有一種方法可以告訴csv讀者，第一行是標題，但是你的代碼無論如何都會在空行上出錯。另外，因爲delim是''，所以你的索引需要是3和9，而不是2和5。 – sberry

@sberry好需要跳過前兩行;我已經做了更正。我通過忽略多個空格來修正索引問題（通過設置「skipinitialspace = True」）。如果我在其他地方搞砸了，我將不得不在明天解決它 - 這是在這裏睡覺的時間。 –

在「'csv''模塊的文檔中：_」如果csvfile是一個文件對象，必須在平臺上打開帶有'b'標誌的文件對象，才能產生影響。「_ – eyquem

感嘆，三種解決方案之前，我甚至可以掀起這件事......

import re 

write_file = open("sorted data.txt", "w") 

with open("data.txt", "r") as read_file: 
    for line in read_file: 
     data_list = re.split("[\s\|@]+", line) 
     if data_list[2] == data_list[5]: 
      write_file.write(line) 

write_file.close()

我怕我是Perl或awk的一知半解，但這救了re.split這是很好的和可讀的。

來源

2013-07-06 04:54:39

代碼GNU sed：

sed -r '/^\|\s+\S+\s+([0-9][email protected]).*\|.*\1/!d' file

假設有兩排的報頭：

sed -r '1,2p;/^\|\s+\S+\s+([0-9][email protected]).*\|.*\1/!d' file

來源

2013-07-06 06:12:48 captcha

在Perl：

while(<DATA>){ 

    # split the line by whitespace 
    my @columns = split; 

    # get number from column 3 
    my ($value_col_3) = $columns[2] =~ m{ \A (\d+) \@ }msx; 

    # get number from column 6 
    my ($value_col_6) = $columns[5] =~ m{ \A (\d+) \@ }msx; 

    if($value_col_3 == $value_col_6){ 
    print; 
    } 
} 

__DATA__ 
| 8455 [email protected] | 8132 [email protected] 8131 [email protected] | 68.43 
| 7490 [email protected] | 8868 [email protected] 8867 [email protected] | 68.30 
| 7561 [email protected] | 9185 [email protected] 9184 [email protected] | 66.83 
| 8776 [email protected] | 7481 [email protected] 7480 [email protected] | 65.55 
| 8867 [email protected] | 8432 [email protected] 8431 [email protected] | 64.48 
| 9832 [email protected] | 6357 [email protected] 6356 [email protected] | 64.44 
| 9194 [email protected] | 5699 [email protected] 5698 [email protected] | 64.06 
| 8849 [email protected] | 5780 [email protected] 5779 [email protected] | 63.99

來源

2013-07-06 12:20:09 shawnhcorey

這裏的一個Perl的單班輪，使用一個單一的定期前與反向引用PRESSION模式：

perl -ne 'print if m/^\S+\s+\S+\s+(\d+\@)\S+\s+\S+\s+\S+\s+\1/' hhHB_inSameLayer_065_128-maltoLyo12per.tbl > hhHB_inSameLayer_065_128-maltoLyo12per_reduced.tbl

（我很驚訝，沒有人指出了Vijay的原始問題陳述了明顯的缺陷尚未：沒有在符合既定的例子紀錄條件）。

來源

2013-07-07 01:18:26

+1解決方案和備註 – eyquem

import re 

su = ''' 
$1 $2 $3 $4 $5  $6  $7 $8 $9 $10 

| 8455 [email protected] | 8132 [email protected] 8131 [email protected] | 68.43 
| 7490 [email protected] | 8868 [email protected] 8867 [email protected] | 68.30 
| 7561 [email protected] | 9185 [email protected] 9184 [email protected] | 66.83 
| 8776 [email protected] | 7481 [email protected] 7480 [email protected] | 65.55 
| 8867 [email protected] | 8432 [email protected] 8431 [email protected] | 64.48 
| 9832 [email protected] | 6357 [email protected] 6356 [email protected] | 64.44 
| 9194 [email protected] | 5699 [email protected] 5698 [email protected] | 64.06 
| 8849 [email protected] | 5780 [email protected] 5779 [email protected] | 63.99''' 

f = re.compile(
    '(^\|[^|]+?[ \t](\S+?)@\S+[ \t]+?' 
    '\|[^|]+?[ \t](\\2)@\S+.+)', 
    re.MULTILINE)\ 
    .finditer 

print [m.group(1) for m in f(su)]

來源

2013-07-07 09:13:35 eyquem

按照一定的標準選擇行

回答

相關問題