2014-07-19 40 views
2

問題呈現計數的重複對與約束

我有以下文件(文件是根據所有三列排序):

D000001 D000001 1975 
D000001 D000001 1976 
D000001 D002413 1976 
D000001 D002413 1979 
D000001 D002413 1987 
D000001 D004298 1976 
D000002 D000002 1985 
D000003 D000900 1975 
D000003 D000900 1990 
D000003 D004134 1983 
D000003 D004134 1986 

我需要計算重複對(在第一和第二列),並且每個這樣的對分配第三列中的最低值。對於我的玩具文件輸出應該是:

D000001 D000001 2 1975 
D000001 D002413 3 1976 
D000001 D004298 1 1976 
D000002 D000002 1 1985 
D000003 D000900 2 1975 
D000003 D004134 2 1983 

我的問題

  1. 文件是巨大的(1 GB到5 GB),我不知道什麼是最合適的編程結構,在此實施設置?
  2. 如何正確打印最後(第三)列?在當前設置(下面的檢查代碼)中,程序打印最後一個(最高)值。

我對當前輸出的初始嘗試如下。

my_dict = {} 

with open('test.srt') as infile: 
    for line in infile: 
    line = line.rstrip() 
    word1, word2, year = line.split('|') 
    year = int(year) 
    my_tuple = (word1, word2) 
    if my_tuple in my_dict: 
     freq += 1 
     my_dict[my_tuple] = (freq, year) 
    else: 
     freq = 1 
     my_dict[my_tuple] = (freq, year) 

for key, value in my_dict.items(): 
    print key[0], key[1], value 

電流輸出:

D000001 D000001 (2, 1976) ## Should be 1976 etc. 
D000001 D002413 (3, 1987) 
D000001 D004298 (1, 1976) 
D000002 D000002 (1, 1985) 
D000003 D000900 (2, 1990) 
D000003 D004134 (2, 1986) 
+0

爲什麼文件巨大的*和*排序?似乎時間已經被浪費在準備這樣的事情上,雖然它在現階段大大簡化了處理過程,因爲重複對是連續的。 – Kaz

回答

2

解決方案:

#!/usr/bin/env python 


def readdata(filename): 
    last = [] 
    count = 0 

    with open(filename, "r") as fd: 
     for line in fd: 
      tokens = line.strip().split() 
      tokens[2] = int(tokens[2]) 

      if not last: 
       last = tokens 

      if tokens[:2] != last[:2]: 
       yield last[:2], count or 1, last[2] 
       last = tokens 
       count = 1 
      else: 
       count += 1 

      tokens[2] = min(tokens[2], last[2]) 

     yield last[:2], count, last[2] 


with open("output.txt", "w") as fd: 
    for words, count, year in readdata("data.txt"): 
     fd.write(
      "{0:s} {1:s} ({2:d} {3:d})\n".format(
       words[0], words[1], count, year 
      ) 
     ) 

輸出:

D000001 D000001 (2 1975) 
D000001 D002413 (3 1976) 
D000001 D004298 (1 1976) 
D000002 D000002 (1 1985) 
D000003 D000900 (2 1975) 
D000003 D004134 (2 1983) 

討論:

  • 這種讀取和處理迭代中的數據(Python 2.x),所以它不會讀取所有內容,以便處理非常大的數據文件。
  • 只要對輸入數據進行排序,也不需要複雜的數據結構。我們只需要跟蹤最後一組令牌並記錄每組「重複」的最小年份。

有效的算法非常相似itertools.groupby查看使用這一點,但假設Python 3.x都有對方的回答)。

可能值得注意的是這個實現也是「O(n`))(Big O)。

+0

1.5 GB行約60秒:) – Andrej

+0

@詹姆斯:做得很好。一點解釋是值得歡迎的。特別強調不需要字典。 –

3

由於該文件是大,你不應該使用在內存中的字典來管理數據。開始讀取源文件並將結果直接輸出到目標文件,所有你真正需要的是3個變量,

一個存儲當前元組,第二個存儲計數,第三個存儲最高值。當元組更改時,將值寫入輸出文件並繼續。

這一個將有很少的內存佔用量,並可以處理瘋狂的大文件。但是,當然這隻會因爲你的元組被排序而起作用。

3

GROUPBY和發電機的路要走:

import csv 
from itertools import groupby 

def count_duplicate(it): 
    # group by frist two fields 
    groups = groupby(it, lambda line: line[:2]) 
    # this will produce (key, group) pairs, where a group is an iterator 
    # containing ['field0', 'field1', year] values were the field0 and field1 
    # strings are the same respectively 
    # the min_and_count function converts such a group into count and min pair 
    def min_and_count(group): 
     i, min_year = 0, 99999 
     for _, _, year in group: 
      i += 1 
      min_year = year if year < min_year else min_year 
     return (i, min_year) 

    yield from map(lambda x: x[0] + [min_and_count(x[1])], groups) 


with open("test.srt") as fp: 
    # this reads the lines in a lazy fashion and filter empty lines out 
    lines = filter(bool, csv.reader(fp, delimiter=' ')) 
    # convert the last value to integer (still in a lazy fashion) 
    lines = map(lambda line: [line[0], line[1], int(line[2])], lines) 
    # write result to another file 
    with open("result_file", "w") as rf: 
     for record in count_duplicate(lines): 
      rf.write(str(record) + '\n') 

注:該解決方案是一個Python 3.x的解決方案,其中filtermap返回迭代器,而不是list(S),因爲它們在Python 2做.X

2

TXR

@(repeat) 
@dleft @dright @num 
@ (collect :gap 0 :vars (dupenum)) 
@dleft @dright @dupenum 
@ (end) 
@ (output) 
@dleft @dright @(+ 1 (length dupenum)) @num 
@ (end) 
@(end) 

兼營:

$ txr data.txr data 
D000001 D000001 2 1975 
D000001 D002413 3 1976 
D000001 D004298 1 1976 
D000002 D000002 1 1985 
D000003 D000900 2 1975 
D000003 D004134 2 1983 

AWK:

$1 != old1 || $2 != old2 { printf("%s", out); 
          count = 0 
          old1 = $1 
          old2 = $2 
          old3 = $3 } 

         { out = $1 " " $2 " " ++count " " old3 "\n" } 

END      { printf("%s", out); } 

$ awk -f data.awk data 
D000001 D000001 2 1975 
D000001 D002413 3 1976 
D000001 D004298 1 1976 
D000002 D000002 1 1985 
D000003 D000900 2 1975 
D000003 D004134 2 1983 

TXR Lisp的功能性的一行:

$ txr -t '[(opip (mapcar* (op split-str @1 " ")) 
       (partition-by [callf list first second]) 
       (mapcar* (aret `@[@1 0..2] @(+ 1 (length @rest)) @[@1 2]`))) 
      (get-lines)]' < data 
D000001 D000001 2 1975 
D000001 D002413 3 1976 
D000001 D004298 1 1976 
D000002 D000002 1 1985 
D000003 D000900 2 1975 
D000003 D004134 2 1983