zip（）替代迭代通過兩個迭代

我有兩個大（〜100 GB）的文本文件，必須同時迭代。zip（）替代迭代通過兩個迭代

Zip適用於較小的文件，但我發現它實際上是從我的兩個文件中創建一個行列表。這意味着每一行都被存儲在內存中。我不需要多次對線路做任何事情。

handle1 = open('filea', 'r'); handle2 = open('fileb', 'r') 

for i, j in zip(handle1, handle2): 
    do something with i and j. 
    write to an output file. 
    no need to do anything with i and j after this.

是否有zip（）可以充當發電機，讓我通過這兩個文件迭代，而無需使用> 200GB RAM的選擇嗎？

來源

2010-02-24 Austin Richardson

...實際上，我知道一種方式，但它似乎不是pythonic - 而line1：line1 = handle1.readline（）; line2 = handle2.readline（）;用line1和line2做些什麼... –

說到內存受限的環境，你可能會發現這個有趣的內容http://neopythonic.blogspot.com/2008/10/sorting-million-32-bit-integers-in-2mb.html –

itertools具有功能izip，做的是

from itertools import izip 
for i, j in izip(handle1, handle2): 
    ...

如果文件大小不同，你可以使用izip_longest，如izip將在更小的文件停止。

來源

2010-02-24 03:11:49

-1

是這樣的？羅嗦，但它似乎是你要求的。

它可以被調整來做一些事情，比如適當的合併來匹配兩個文件之間的鍵，這通常比簡單的zip函數更需要什麼。而且，這不會被截斷，這也是SQL OUTER JOIN算法的作用，它與zip所做的不同，也是更典型的文件。

with open("file1","r") as file1: 
    with open("file2", "r" as file2: 
     for line1, line2 in parallel(file1, file2): 
      process lines 

def parallel(file1, file2): 
    if1_more, if2_more = True, True 
    while if1_more or if2_more: 
     line1, line2 = None, None # Assume simplistic zip-style matching 
     # If you're going to compare keys, then you'd do that before 
     # deciding what to read. 
     if if1_more: 
      try: 
       line1= file1.next() 
      except StopIteration: 
       if1_more= False 
     if if2_more: 
      try: 
       line2= file2.next() 
      except StopIteration: 
       if2_more= False 
     yield line1, line2

來源

2010-02-24 03:13:27

如果if1_more或if2_more：'，你不是故意的嗎？爲什麼將文件1和文件2封裝在iters中，當文件已經是iters？最後，這只是一個學術問題：「如果我必須這樣做，我該怎麼做？」行使？當然，我們寧願在std庫中的itertools模塊中使用izip或izip_longest，而不是寫20行自制代碼來完成同樣的事情，但必須維護和支持（以及調試！）。 – PaulMcG

@Paul McGuire：是的，OR是正確的。明確的iter需要使用next，並在EOF中獲得正確的StopIteraction異常。不，這不是「學術」。這是對這個問題的回答。問題很模糊，itertools可能不提供所需的功能。這可能不會，但這可以調整。 –

我正在運行Py2.5.4，並且在文件末尾的文件對象上調用'next（）'會爲我引發StopIteration。 – PaulMcG

如果要截斷，以最短的文件：

handle1 = open('filea', 'r') 
handle2 = open('fileb', 'r') 

try: 
    while 1: 
     i = handle1.next() 
     j = handle2.next() 

     do something with i and j. 
     write to an output file. 

except StopIteration: 
    pass 

finally: 
    handle1.close() 
    handle2.close()

否則

handle1 = open('filea', 'r') 
handle2 = open('fileb', 'r') 

i_ended = False 
j_ended = False 
while 1: 
    try: 
     i = handle1.next() 
    except StopIteration: 
     i_ended = True 
    try: 
     j = handle2.next() 
    except StopIteration: 
     j_ended = True 

     do something with i and j. 
     write to an output file. 
    if i_ended and j_ended: 
     break 

handle1.close() 
handle2.close()

或者

handle1 = open('filea', 'r') 
handle2 = open('fileb', 'r') 

while 1: 
    i = handle1.readline() 
    j = handle2.readline() 

    do something with i and j. 
    write to an output file. 

    if not i and not j: 
     break 
handle1.close() 
handle2.close()

來源

2010-02-24 03:15:14 voyager

如果這兩個文件的長度不同？這將在較短的一個截斷。希望這是理想的行爲。 –

@ S.Lott：這不是什麼'zip'嗎？ – voyager

@ S.Lott - 只有當i_ended和j_ended時它纔會跳出while-forever循環，所以它會讀取直到長文件結束。但是肯定有改進的餘地。如果一個文件比另一個文件短得多，當我們已經知道文件已經結束時，當前代碼將調用.next（）並捕獲StopIteration *許多次。簡單的做法：'如果不是i_ended：嘗試：i = handel1.next（）...'（就像你在'if if_more：'代碼中做的那樣）。（啊！我看到你的評論是對原始代碼的迴應，而不是編輯版本 - 對不起，因爲接受！） – PaulMcG

您可以使用izip_longest這樣的墊用空行

較短文件中蟒2.6

from itertools import izip_longest 
with handle1 as open('filea', 'r'): 
    with handle2 as open('fileb', 'r'): 
     for i, j in izip_longest(handle1, handle2, fillvalue=""): 
      ...

或python3.1

from itertools import izip_longest 
with handle1 as open('filea', 'r'), handle2 as open('fileb', 'r'): 
    for i, j in izip_longest(handle1, handle2, fillvalue=""): 
     ...

來源

2010-02-24 03:38:16

+ for'with' - 我喜歡Py3.1語法來保持縮進級別。 – PaulMcG

對於python3，izip_longest實際上是zip_longest。

from itertools import zip_longest 

for i, j in izip(handle1, handle2): 
    ...

來源

2017-10-19 10:47:17 Flippym

zip（）替代迭代通過兩個迭代

回答

相關問題