Python - 如何讀取文本文件中的特定行？

我有一個巨大的文本文件（12GB）。這些行是製表符分隔的，第一列包含一個ID。對於每個ID我想做點什麼。因此，我的計劃是從第一行開始，逐行閱讀第一行，直到達到下一個ID。Python - 如何讀取文本文件中的特定行？

start_line = b 
num_lines = 377763316 

while b < num_lines: 
    plasmid1 = linecache.getline("Result.txt", b-1) 
    plasmid1 = plasmid1.strip("\n") 
    plasmid1 = plasmid1.split("\t") 

    plasmid2 = linecache.getline("Result.txt", b) 
    plasmid2 = plasmid2.strip("\n") 
    plasmid2 = plasmid2.split("\t") 


    if not str(plasmid1[0]) == str(plasmid2[0]): 
     end_line = b 
     #do something

該代碼有效，但問題是linecache似乎每次都會重新加載txt文件。如果我不提高性能，代碼將運行數年。

我很感謝您的幫助，如果您有一個好主意如何解決問題或知道替代方法！

感謝，菲利普

來源

2017-02-25 Philipp

行是製表符分隔的？聽起來像列向我？ – RuDevel

請顯示所有代碼。什麼是'linecache' – eguaio

@eguaio：https：//docs.python.org/3/library/linecache.html – cdarke

你應該打開該文件只有一次，並逐一線。

with open('Result.txt', 'r') as f: 
    aline = f.next() 
    currentid = aline.split('\t', 1)[0] 
    for nextline in f: 
     nextid = nextline.split('\t', 1)[0] 
     if nextid != currentid: 
      #do stuff 
      currentid = nextid

你明白了，只是使用普通的python。每次迭代只讀取一行。分割中的額外1參數將僅分割到第一個選項卡，從而提高性能。任何專業圖書館都不會獲得更好的表現。只有簡單的C語言實現可以勝過這種方法。

如果您得到AttributeError: '_io.TextIOWrapper' object has，可能是因爲您使用的是Python 3.X（請參閱問題io-textiowrapper-object）。試試這個版本，而不是：

with open('Result.txt', 'r') as f: 
    aline = f.readline() 
    currentid = aline.split('\t', 1)[0] 
    while aline != '': 
     aline = f.readline() 
     nextid = aline.split('\t', 1)[0] 
     if nextid != currentid: 
      #do stuff 
      currentid = nextid

來源

2017-02-25 18:21:28 eguaio

感謝您的評論！我收到以下錯誤：AttributeError：'_io.TextIOWrapper'對象沒有'next'屬性任何想法？ – Philipp

這是一個python 2 vs 3不兼容。 – eguaio

我認爲numpy.loadtxt()是要走的路。此外，通過usecols參數來指定您實際需要的文件列是很好的。 Numpy軟件包是一款堅實的庫，它具有高性能。

致電loadtxt()後，您將收到ndarray。

來源

2017-02-25 18:21:42 Laszlowaty

可以使用itertools：

from itertools import takewhile 

class EqualityChecker(object): 
    def __init__(self, id): 
     self.id = id 

    def __call__(self, current_line): 
     result = False 
     current_id = current_line.split('\t')[0] 

     if self.id == current_id: 
      result = True 

     return result 


with open('hugefile.txt', 'r') as f: 
    for id in ids: 
     checker = EqualityChecker(id) 
     for line in takewhile(checker, f.xreadlines()): 
      do_stuff(line)

在外環id實際上可以是從具有id不匹配的先前值的第一行得到。

來源

2017-02-25 18:41:19 mshrbkv

Python - 如何讀取文本文件中的特定行？

回答

相關問題