2013-10-10 47 views
0

我有一個文件,其中有許多字段由「|」 (管道)字符。 我想讀取此文件並創建與特定字段的值一樣多的文件。 這裏一個例子:Python:更快的方法來讀取和創建文件

L219| |791|P|PIPPO|PLUTO|1|18081926|I262|XYZXCV12D35F345S|| 
L219| |1241|P|PAPERINO|TOPOLINO|2|21041937|F335|FVGHWU54G56S456U|| 
L219| |437793|G|TOPOLANDIA SAS|L219|12345678910| 
L219| |437794|G|PAPERANDIA|L219|10987654321| 

如果第四字段等於「G」,則記錄進入「file_pg.txt」,否則,如果它等於「P」變爲「file_pf.txt」。

我寫下面的代碼(我是Python中的新手),但執行具有巨大維度(300mb)的文件需要很長時間,您有任何改進它的建議嗎?

file = open('D:\\mydirectory\\soggetti.txt','r') 
file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pf.close() 
file_pg.close() 

i = 0 
with file: 
    for line in file: 
     i = 0 
     c = 0 
     while i < len(line): 
      carattere = line[i] 
      if carattere == "|": 
       c = c + 1 
       if c == 4: 
        if line[i-1] == "P": 
         file_pf = open("D:\\mydirectory\\file_pf.txt","a") 
         file_pf.write(line) 
         file_pf.close() 
         break 
        elif line[i-1] == "G": 
         file_pg = open("D:\\mydirectory\\file_pg.txt","a") 
         file_pg.write(line) 
         file_pg.close() 
         break 
      i = i + 1 
file.close() 

謝謝!

Alberto

+0

'line.split( '|')[3]'應該給你 'P' 或 'G' 爲每一行。打開和關閉每個寫入的輸出文件也非常昂貴。在開始時打開它們,並在最後關閉它們。如果你擔心異常,那麼使用'closing'上下文管理器。 – PaulMcG

回答

0

打開和關閉文件操作相對較慢。如果可能,您應該只打開和關閉一次文件。在你的情況下,你可以將p和g行存儲在列表中,然後在循環結束後立即寫入所有行。

file = open('D:\\mydirectory\\soggetti.txt','r') 
file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pf.close() 
file_pg.close() 


p_lines = [] 
g_lines = [] 
i = 0 
with file: 
    for line in file: 
     i = 0 
     c = 0 
     while i < len(line): 
      carattere = line[i] 
      if carattere == "|": 
       c = c + 1 
       if c == 4: 
        if line[i-1] == "P": 
         p_lines.append(line) 
         break 
        elif line[i-1] == "G": 
         g_lines.append(line) 
         break 
      i = i + 1 
file.close() 

file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pf.writelines(p_lines) 
file_pf.close() 

file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pg.writelines(g_lines) 
file_pg.close() 

您還可以通過使用split更容易地識別每行中字段的內容。

file = open('D:\\mydirectory\\soggetti.txt','r') 
file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pf.close() 
file_pg.close() 


p_lines = [] 
g_lines = [] 
with file: 
    for line in file: 
     fields = line.split("|") 
     if fields[3] == "P": 
      p_lines.append(line) 
     elif fields[3] == "G": 
      g_lines.append(line) 
file.close() 

file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pf.writelines(p_lines) 
file_pf.close() 

file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pg.writelines(g_lines) 
file_pg.close() 

順便說,嚴格來說,你並不需要使用with明確關閉該文件一旦你用它做。你可以做一個或另一個。在腳本開始時不需要打開並立即關閉file_pffile_pg

p_lines = [] 
g_lines = [] 
with open('D:\\mydirectory\\soggetti.txt','r') as file: 
    for line in file: 
     fields = line.split("|") 
     if fields[3] == "P": 
      p_lines.append(line) 
     elif fields[3] == "G": 
      g_lines.append(line) 

file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pf.writelines(p_lines) 
file_pf.close() 

file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pg.writelines(g_lines) 
file_pg.close() 

如果您想擁有比「P」和未來的「g」等多個線路類型,它可以爲您節省一些時間,各種線條的存儲在詞典:

from collections import defaultdict 
lines_to_write = defaultdict(list) 
with file as open('D:\\mydirectory\\soggetti.txt','r'): 
    for line in file: 
     fields = line.split("|") 
     lineType = fields[3].lower() 
     lines_to_write[lineType].append(line) 

for lineType, lines in lines_to_write.iteritems(): 
    filename = "D:\\mydirectory\\file_{}f.txt".format(lineType) 
    with file as open(filename,"w"): 
     file.writelines(lines) 

您可以通過跟蹤您所在的行號並定期打印消息來向用戶報告已處理了多少行。

how_often_to_report = 100 #prints message every one hundred lines 
with file as open('D:\\mydirectory\\soggetti.txt','r'): 
    for line_number, line in enumerate(file): 
     if line_number % how_often_to_report == 0: 
      print "{} lines processed", line_number 
     #do rest of processing work here 
+0

當proc執行時可以插入一個計數器來查看處理的記錄嗎? – user2867049

+0

是的,您可以使用'enumerate'確定通過跟蹤當前行號處理的記錄數。編輯。 – Kevin

0
Read line from file 
split on | 
P = empty list 
G = empty list 
if splitted_line[index] is equal to P 
add line to P 
elif splitted_line[index] is equal to G 
add line to G 
open file for P 
write all lines in P 
close file for P 
open file for G 
write all lines in G 
close file for G 
1

我會去:

with open('D:\\mydirectory\\soggetti.txt','r') as source_file: 
    with open("D:\\mydirectory\\file_pf.txt","w") as file_pf: 
     with open("D:\\mydirectory\\file_pg.txt","w") as file_pg: 

      for line in source_file: 
       if line.split("|")[3] == "P": 
        file_pf.write(line) 
       elif line.split("|")[3] == "G": 
        file_pg.write(line) 

如果你所關心的速度,它可能是更好的事情可做:

with open('D:\\mydirectory\\soggetti.txt','r') as source_file: 
    listP = [] 
    listG = []   
    for line in source_file: 
     char = line.split("|")[3] 
     if char == "P": 
      listP.append(line) 
      file_pf.write(line) 
     elif char == "G": 
      listG.append(line) 
      file_pg.write(line) 

with open("D:\\mydirectory\\file_pf.txt","w") as file_pf: 
    for line in listP 
     file_pf.write(line) 

with open("D:\\mydirectory\\file_pg.txt","w") as file_pg: 
    for line in listG 
     file_pg.write(line) 
0

我沒有測試這個,但下面的東西應該更快

file = open('D:\\mydirectory\\soggetti.txt','r') 
file_pf = open("D:\\mydirectory\\file_pf.txt","a") 
file_pg = open("D:\\mydirectory\\file_pg.txt","a") 

for line in file: 
    bits = line.split("|") 
    if bits[3] == "P": 
     file_pf.write(line) 
    if bits[3] == "G": 
     file_pg.write(line) 


file.close() 
file_pf.close() 
file_pg.close() 
0

下面的代碼應該比你在做什麼更快,因爲。

  1. 你沒有循環遍歷每一個字符。
  2. 您不必每次寫入都打開文件。
  3. 如果要評估的條件較少。

file = open('D:\\mydirectory\\soggetti.txt','r') 
file_pf = open("D:\\mydirectory\\file_pf.txt","w") 
file_pg = open("D:\\mydirectory\\file_pg.txt","w") 
file_pf.close() 
file_pg.close() 


file_pf = open("D:\\mydirectory\\file_pf.txt","a") 
file_pg = open("D:\\mydirectory\\file_pg.txt","a") 
with file: 
    for line in file: 
     switch = line.split('|')[3] 
     write = file_pf.write if 'P' in switch else file_pg.write 
     write(line) 

file_pg.close() 
file_pf.cloe() 
file.close() 
+0

我相信你需要在你的'write = ...'行中省略括號,否則'write'不會引用你想要的函數對象。 – Kevin