使用python從文件中讀取行

我有一個包含近100000行的文件。我想做一個清理過程（小寫，刪除停用詞等），但它需要時間。使用python從文件中讀取行

萬用腳本需要15分鐘的示例。對於所有文件，我預計需要150分鐘。然而它需要5個小時。

在開始閱讀本文件使用：

fileinput = open('tweets.txt', 'r') 

lines = fileinput.read().lower() #for lower case, however it load all file 

for line in fileinput: 
    lines = line.lower()

問：我可以用一種方法來讀取第10000行做清洗和線路等，閱讀下一篇博客之後的過程？

來源

2013-01-04 Joe Kalvos

這可能會有所幫助：http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python –

更改您的腳本如下：

with open('tweets.txt', 'r') as fileinput: 
    for line in fileinput: 
    """do what you need to do with each line""" 
    line = line.lower()

所以，基本上，不要在整個文件到使用read()存儲器讀，只是遍歷打開的文件的行。當你將一個巨大的文件讀入內存時，你的進程可能會增長到系統需要將部分內存換出的地步，這會使其非常緩慢。

來源

2013-01-04 09:59:13 piokuc

有沒有理由使用'.readlines（）' - 你可以迭代文件對象本身。 – Amber

@Amber右邊，更正 – piokuc

我會強烈建議逐行操作而不是一次讀取整個文件（換句話說，不要使用.read()）。

with open('tweets.txt', 'r') as fileinput: 
    for line in fileinput: 
     line = line.lower() 
     # ... do something with line ... 
     # (for example, write the line to a new file, or print it)

This will automatically take advantage of Python's built-in buffering capabilities。

來源

2013-01-04 09:59:21 Amber

用這個我爲每一行製作過程。這可能需要更多時間嗎？ –

取決於過程。在平均情況下，使用文件緩衝保存時，額外函數調用的任何額外時間將超過補償時間。 – Amber

嘗試一行在時間上的文件工作：

lowered = []  

with open('tweets.txt', 'r') as handle: 
    for line in handle: 
     # keep accumulating the results ... 
     lowered.append(line.lower()) 
     # or just dump the to stdout right away 
     print(line) 

for line in lowered: 
    # print or write to file or whatever you require

這樣，你降低了內存開銷，其中，對於大文件的情況下可能會導致交換和殺死性能。

這裏有一個文件中的一些基準測試與約1M線路：

# (1) real 0.223 user 0.195 sys 0.026 pcpu 98.71 
with open('medium.txt') as handle: 
    for line in handle: 
     pass 

# (2) real 0.295 user 0.262 sys 0.025 pcpu 97.21 
with open('medium.txt') as handle: 
    for i, line in enumerate(handle): 
     pass 
    print(i) # 1031124 

# (3) real 21.561 user 5.072 sys 3.530 pcpu 39.89 
with open('medium.txt') as handle: 
    for i, line in enumerate(handle): 
     print(line.lower()) 

# (4) real 1.702 user 1.605 sys 0.089 pcpu 99.50 
lowered = [] 
with open('medium.txt') as handle: 
    for i, line in enumerate(handle): 
     lowered.append(line.lower()) 

# (5) real 2.307 user 1.983 sys 0.159 pcpu 92.89 
lowered = [] 
with open('medium.txt', 'r') as handle: 
    for i, line in enumerate(handle): 
     lowered.append(line.lower()) 

with open('lowered.txt', 'w') as handle: 
    for line in lowered: 
     handle.write(line)

你也可以迭代超過兩個文件一次：

# (6) real 1.944 user 1.666 sys 0.115 pcpu 91.59 
with open('medium.txt', 'r') as src, open('lowered.txt', 'w') as sink: 
    for i, line in enumerate(src): 
     sink.write(line.lower())

結果如表：

# (1) noop     0.223 
# (2) w/ enumerate   0.295 
# (4) list buffer   1.702 
# (6) on-the-fly    1.944 
# (5) r -> list buffer -> w 2.307 
# (3) stdout print   21.561

來源

2013-01-04 09:59:39 miku

更好的辦法是寫出或打印行，因爲它們處理，所以你不必緩衝內存中的整個處理行的列表。 – Amber

@Amber，是的，我加了一張紙條。 – miku

使用python從文件中讀取行

回答

相關問題