查找文件

我有三個文本文件之間的共同名單：查找文件

的fileA：

13 abc 
123 def 
234 ghi 
1234 jkl 
12 mno

FILEB：

12 abc 
12 def 
34 qwe 
43 rty 
45 mno

fileC：

12 abc 
34 sdg 
43 yui 
54 poi 
54 def

我想看看第二列中的所有值都是matc在文件之間切換。如果第二列已經排序，則以下代碼有效。但如果第二列未排序，我如何排序第二列並比較文件？

fileA = open("A.txt",'r') 
fileB = open("B.txt",'r') 
fileC = open("C.txt",'r') 

listA1 = [] 
for line1 in fileA: 
    listA = line1.split('\t') 
    listA1.append(listA) 


listB1 = [] 
for line1 in fileB: 
    listB = line1.split('\t') 
    listB1.append(listB) 


listC1 = [] 
for line1 in fileC: 
    listC = line1.split('\t') 
    listC1.append(listC) 

for key1 in listA1: 
    for key2 in listB1: 
     for key3 in listC1: 
      if key1[1] == key2[1] and key2[1] == key3[1] and key3[1] == key1[1]: 
       print "Common between three files:",key1[1] 

print "Common between file1 and file2 files:" 
for key1 in listA1: 
    for key2 in listB1: 
     if key1[1] == key2[1]: 
      print key1[1] 

print "Common between file1 and file3 files:" 
for key1 in listA1: 
    for key2 in listC1: 
     if key1[1] == key2[1]: 
      print key1[1]

來源

2013-03-29 gthm

如果你只是想通過第二列A1，B1，並C1進行排序，這很容易：

listA1.sort(key=operator.itemgetter(1))

如果你不明白itemgetter，這是相同的：

listA1.sort(key=lambda element: element[1])

不過，我認爲更好的解決方法就是使用一個set：

setA1 = set(element[1] for element in listA1) 
setB1 = set(element[1] for element in listB1) 
setC1 = set(element[1] for element in listC1)

或者，更簡單地說，首先不要建立列表;做到這一點：

setA1 = set() 
for line1 in fileA: 
    listA = line1.split('\t') 
    setA1.add(listA[1])

無論哪種方式：

print "Common between file1 and file2 files:" 
for key in setA1 & setA2: 
    print key

爲了進一步簡化它，你可能想重複的東西，第一重構爲功能：

def read_file(path): 
    with open(path) as f: 
     result = set() 
     for line in f: 
      columns = line.split('\t') 
      result.add(columns[1]) 
    return result 

setA1 = read_file('A.txt') 
setB1 = read_file('B.txt') 
setC1 = read_file('C.txt')

然後你可以找到更多的機會。例如：

def read_file(path): 
    with open(path) as f: 
     return set(row[1] for row in csv.reader(f))

正如約翰·克萊門茨指出的那樣，你甚至不真正需要它們的所有三個是集，只是A1，所以你可以代替做到這一點：

def read_file(path): 
    with open(path) as f: 
     for row in csv.reader(f): 
      yield row[1] 

setA1 = set(read_file('A.txt')) 
iterB1 = read_file('B.txt') 
iterC1 = read_file('B.txt')

您唯一需要的其他變化是，你必須調用intersection而不是使用&運營商，所以：

for key in setA1.intersection(iterB1):

我不確定這最後的改變實際上是一種改進。但在Python 3.3中，你唯一需要做的就是將return set(…)改爲yield from (…)，我大概就會這樣做。（即使文件很大並且有大量重複的文件，所以出現了性能損失，我只需要在read_file調用周圍itertools配方unique_everseen附近。）

來源

2013-03-29 20:22:55 abarnert

或...有'A1'和'A2'作爲發電機，用'set'實現最小，然後使用它的'intersection'方法，並保持其他發電機作爲發電機... –

@JonClements：是的，A2和A3可以只是一個'（行[csv.reader（f））行'，只有A1需要是一個明確的'set'。 – abarnert

回答

相關問題