2013-01-04 148 views
0

我有一個包含近100000行的文件。我想做一個清理過程(小寫,刪除停用詞等),但它需要時間。使用python從文件中讀取行

萬用腳本需要15分鐘的示例。對於所有文件,我預計需要150分鐘。然而它需要5個小時。

在開始閱讀本文件使用:

fileinput = open('tweets.txt', 'r') 

lines = fileinput.read().lower() #for lower case, however it load all file 

for line in fileinput: 
    lines = line.lower() 

問:我可以用一種方法來讀取第10000行做清洗和線路等,閱讀下一篇博客之後的過程?

+1

這可能會有所幫助:http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python –

回答

0

更改您的腳本如下:

with open('tweets.txt', 'r') as fileinput: 
    for line in fileinput: 
    """do what you need to do with each line""" 
    line = line.lower() 

所以,基本上,不要在整個文件到使用read()存儲器讀,只是遍歷打開的文件的行。當你將一個巨大的文件讀入內存時,你的進程可能會增長到系統需要將部分內存換出的地步,這會使其非常緩慢。

+0

有沒有理由使用'.readlines()' - 你可以迭代文件對象本身。 – Amber

+0

@Amber右邊,更正 – piokuc

2

我會強烈建議逐行操作而不是一次讀取整個文件(換句話說,不要使用.read())。

with open('tweets.txt', 'r') as fileinput: 
    for line in fileinput: 
     line = line.lower() 
     # ... do something with line ... 
     # (for example, write the line to a new file, or print it) 

This will automatically take advantage of Python's built-in buffering capabilities

+0

用這個我爲每一行製作過程。這可能需要更多時間嗎? –

+0

取決於過程。在平均情況下,使用文件緩衝保存時,額外函數調用的任何額外時間將超過補償時間。 – Amber

1

嘗試一行在時間上的文件工作:

lowered = []  

with open('tweets.txt', 'r') as handle: 
    for line in handle: 
     # keep accumulating the results ... 
     lowered.append(line.lower()) 
     # or just dump the to stdout right away 
     print(line) 

for line in lowered: 
    # print or write to file or whatever you require 

這樣,你降低了內存開銷,其中,對於大文件的情況下可能會導致交換和殺死性能。

這裏有一個文件中的一些基準測試與約1M線路:

# (1) real 0.223 user 0.195 sys 0.026 pcpu 98.71 
with open('medium.txt') as handle: 
    for line in handle: 
     pass 

# (2) real 0.295 user 0.262 sys 0.025 pcpu 97.21 
with open('medium.txt') as handle: 
    for i, line in enumerate(handle): 
     pass 
    print(i) # 1031124 

# (3) real 21.561 user 5.072 sys 3.530 pcpu 39.89 
with open('medium.txt') as handle: 
    for i, line in enumerate(handle): 
     print(line.lower()) 

# (4) real 1.702 user 1.605 sys 0.089 pcpu 99.50 
lowered = [] 
with open('medium.txt') as handle: 
    for i, line in enumerate(handle): 
     lowered.append(line.lower()) 

# (5) real 2.307 user 1.983 sys 0.159 pcpu 92.89 
lowered = [] 
with open('medium.txt', 'r') as handle: 
    for i, line in enumerate(handle): 
     lowered.append(line.lower()) 

with open('lowered.txt', 'w') as handle: 
    for line in lowered: 
     handle.write(line) 

你也可以迭代超過兩個文件一次:

# (6) real 1.944 user 1.666 sys 0.115 pcpu 91.59 
with open('medium.txt', 'r') as src, open('lowered.txt', 'w') as sink: 
    for i, line in enumerate(src): 
     sink.write(line.lower()) 

結果如表:

# (1) noop     0.223 
# (2) w/ enumerate   0.295 
# (4) list buffer   1.702 
# (6) on-the-fly    1.944 
# (5) r -> list buffer -> w 2.307 
# (3) stdout print   21.561 
+0

更好的辦法是寫出或打印行,因爲它們處理,所以你不必緩衝內存中的整個處理行的列表。 – Amber

+0

@Amber,是的,我加了一張紙條。 – miku