2016-03-10 29 views
0

函數需要能夠檢查文件中每行和每列的重複項。文件與重複在每行和每列中查找重複項

例子:

A B C 
A A B 
B C A 

正如你所看到的,有第2行2 A的重複,而且在有兩個A的第1列。 代碼:

def duplication_char(dc): 
    with open (dc,"r") as duplicatechars: 
     linecheck = duplicatechar.readlines() 
    linecheck = [line.split() for line in linecheck] 

    for row in linecheck: 
     if len(set(row)) != len(row): 
      print ("duplicates", " ".join(row)) 


    for column in zip(*checkLine): 
     if len(set(column)) != len(column): 
      print ("duplicates"," ".join(column)) 

回答

4

那麼,這裏是我該怎麼做。

首先,閱讀您的文件和創建內容的2D numpy的數組:

import numpy 
with open('test.txt', 'r') as fil: 
    lines = fil.readlines() 
lines = [line.strip().split() for line in lines] 
arr = numpy.array(lines) 

然後,檢查每一行都有使用套副本(一集有沒有重複,因此,如果集的長度比所述陣列的長度的不同,所述陣列具有一式兩份):

for row in arr: 
    if len(set(row)) != len(row): 
     print 'Duplicates in row: ', row 

然後,檢查如果每個列具有使用集,重複通過轉您numpy的陣列:

for col in arr.T: 
    if len(set(col)) != len(col): 
     print 'Duplicates in column: ', col 

如果你包這一切的功能:

def check_for_duplicates(filename): 
    import numpy 
    with open(filename, 'r') as fil: 
     lines = fil.readlines() 
    lines = [line.strip().split() for line in lines] 
    arr = numpy.array(lines) 

    for row in arr: 
     if len(set(row)) != len(row): 
      print 'Duplicates in row: ', row 

    for col in arr.T: 
     if len(set(col)) != len(col): 
      print 'Duplicates in column: ', col 

正如Apero建議,你也可以使用壓縮(https://docs.python.org/3/library/functions.html#zip)這樣做沒有numpy的:

def check_for_duplicates(filename): 
    with open(filename, 'r') as fil: 
     lines = fil.readlines() 
    lines = [line.strip().split() for line in lines] 

    for row in lines: 
     if len(set(row)) != len(row): 
      print 'Duplicates in row: ', row 

    for col in zip(*lines): 
     if len(set(col)) != len(col): 
      print 'Duplicates in column: ', col 

在你的榜樣此,代碼打印:

# Duplicates in row: ['A' 'A' 'B'] 
# Duplicates in column: ['A' 'A' 'B'] 
+0

山坳= ZIP(*行)就足夠了,沒必要numpy的這裏 –

+0

@Apero你是絕對正確的。我編輯我的答案。謝謝。 –

+0

@JohnPal檢查zip文檔(https://docs.python.org/3/library/functions.html#zip)。 'zip'會將一些給定迭代器的元素聚合成元組。例如,'x = [1,2,3]; y = [4,5,6]; zip(x,y)''返回'[(1,4),(2,5),(3,6)]''。要了解'* lines'的含義,請查看此鏈接(http://agiliq.com/blog/2012/06/understanding-args-and-kwargs/) –

1

你可以有列表的列表,並使用zip來transpos e it。

鑑於你例如,嘗試:

from collections import Counter 

with open(fn) as fin: 
    data=[line.split() for line in fin] 

rowdups={} 
coldups={} 
for d, m in ((rowdups, data), (coldups, zip(*data))): 
    for i, sl in enumerate(m): 
     count=Counter(sl) 
     for c in count.most_common(): 
      if c[1]>1: 
       d.setdefault(i, []).append(c) 

>>> rowdups 
{1: [('A', 2)]} 
>>> coldups 
{0: [('A', 2)]}