2013-03-29 95 views
0

我有三個文本文件之間的共同名單:查找文件

的fileA:

13 abc 
123 def 
234 ghi 
1234 jkl 
12 mno 

FILEB:

12 abc 
12 def 
34 qwe 
43 rty 
45 mno 

fileC:

12 abc 
34 sdg 
43 yui 
54 poi 
54 def 

我想看看第二列中的所有值都是matc在文件之間切換。如果第二列已經排序,則以下代碼有效。但如果第二列未排序,我如何排序第二列並比較文件?

fileA = open("A.txt",'r') 
fileB = open("B.txt",'r') 
fileC = open("C.txt",'r') 

listA1 = [] 
for line1 in fileA: 
    listA = line1.split('\t') 
    listA1.append(listA) 


listB1 = [] 
for line1 in fileB: 
    listB = line1.split('\t') 
    listB1.append(listB) 


listC1 = [] 
for line1 in fileC: 
    listC = line1.split('\t') 
    listC1.append(listC) 

for key1 in listA1: 
    for key2 in listB1: 
     for key3 in listC1: 
      if key1[1] == key2[1] and key2[1] == key3[1] and key3[1] == key1[1]: 
       print "Common between three files:",key1[1] 

print "Common between file1 and file2 files:" 
for key1 in listA1: 
    for key2 in listB1: 
     if key1[1] == key2[1]: 
      print key1[1] 

print "Common between file1 and file3 files:" 
for key1 in listA1: 
    for key2 in listC1: 
     if key1[1] == key2[1]: 
      print key1[1] 

回答

3

如果你只是想通過第二列A1B1,並C1進行排序,這很容易:

listA1.sort(key=operator.itemgetter(1)) 

如果你不明白itemgetter,這是相同的:

listA1.sort(key=lambda element: element[1]) 

不過,我認爲更好的解決方法就是使用一個set

setA1 = set(element[1] for element in listA1) 
setB1 = set(element[1] for element in listB1) 
setC1 = set(element[1] for element in listC1) 

或者,更簡單地說,首先不要建立列表;做到這一點:

setA1 = set() 
for line1 in fileA: 
    listA = line1.split('\t') 
    setA1.add(listA[1]) 

無論哪種方式:

print "Common between file1 and file2 files:" 
for key in setA1 & setA2: 
    print key 

爲了進一步簡化它,你可能想重複的東西,第一重構爲功能:

def read_file(path): 
    with open(path) as f: 
     result = set() 
     for line in f: 
      columns = line.split('\t') 
      result.add(columns[1]) 
    return result 

setA1 = read_file('A.txt') 
setB1 = read_file('B.txt') 
setC1 = read_file('C.txt') 

然後你可以找到更多的機會。例如:

def read_file(path): 
    with open(path) as f: 
     return set(row[1] for row in csv.reader(f)) 

正如約翰·克萊門茨指出的那樣,你甚至不真正需要它們的所有三個是集,只是A1,所以你可以代替做到這一點:

def read_file(path): 
    with open(path) as f: 
     for row in csv.reader(f): 
      yield row[1] 

setA1 = set(read_file('A.txt')) 
iterB1 = read_file('B.txt') 
iterC1 = read_file('B.txt') 

您唯一需要的其他變化是,你必須調用intersection而不是使用&運營商,所以:

for key in setA1.intersection(iterB1): 

我不確定這最後的改變實際上是一種改進。但在Python 3.3中,你唯一需要做的就是將return set(…)改爲yield from (…),我大概就會這樣做。 (即使文件很大並且有大量重複的文件,所以出現了性能損失,我只需要在read_file調用周圍itertools配方unique_everseen附近。)

+1

或...有'A1'和'A2'作爲發電機,用'set'實現最小,然後使用它的'intersection'方法,並保持其他發電機作爲發電機... –

+0

@JonClements:是的,A2和A3可以只是一個'(行[csv.reader(f))行',只有A1需要是一個明確的'set'。 – abarnert