2017-05-24 105 views
0

我身邊有2 GB一個巨大的文件與數據是這樣的:如何讀取一個巨大的文件的塊特定的

>TRINITY_DN19211_c0_g1_i1 len=332 path=[619:0-331] [-1, 619, -2] 
GTCCAAGTATTACACACCGTATGATGAAGCTAACGGTGAATTTTCAAAATGTGTGAAGTT 
TGAGAATGGGTTGCGCCCTGAGATCAAACAGGCGATTGGATACCAGAGGATTCGAAGGTT 
TTCGGAGTTGGTAGACTGCTGCAGGATCTTTGAAGAGGATTCCAGAGCAAGGTCAACTCA 
>TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2] 
ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG 
TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT 
GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG 
TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA 
>TRINITY_DN35855_c0_g1_i1 len=782 path=[760:0-781] [-1, 760, -2] 
CAGGTTTAACTTTAACACCTCCGACCCTGCCTCTAAATTCCTGCACAGAAATTTGGCTTC 
ACAATTAGGACATGTTTGGATAAACAGTTTAATGAAGCACTTTTTTTCATAAATTCTGGT 
ATCTGGCTATAAGACCTAATAATCTGGGGATCTGTTTCATCATCCACGAAGGGAGCCCAA 
>TRINITY_DN67801_c0_g1_i1 len=420 path=[398:0-419] [-1, 398, -2] 
GTACAGAAGGAGATGAACCAGAACTTTGCCTATCTCTACAATCATCTCCTTATCCCTCCT 
TATGACCCAGAGAATCCGGCTGCTCCTATTCCTCCCGTTGTGTCACTACAAATTATGCCT 
>TRINITY_DN52435_c0_g1_i1 len=209 path=[187:0-208] [-1, 187, -2] 
TGGTCAAACTTGTATGAGTTCTAAACTCCTTGGGTTTTCTGCTAAGCGAAAGCCGCTTGT 
ACTTTAGCTTCTGTTTAGTTAGATAGCACCACCTCATAAGCGCAGTTCTGTTTTGAGGTT 

我想寫一個返回的開始塊的代碼,從該說5行並在一行中遇到字符「>」時結束。像這樣放。我想提取許多這樣的卡盤:

>TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2] 
    ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG 
    TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT 
    GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG 
    TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA 

什麼是最好的方法來做到這一點。提前致謝。

回答

1

目前還不清楚你什麼時候想要結束這個大塊,當它在一行或一行開頭遇到'>'時在這裏,所以我假設第一種情況:

chunk = [] 
with open("your_large_file.ext", "r") as f: 
    for _ in xrange(4): # skip 4 lines, use range() on Python 3.x instead 
     next(f) 
    for line in f: 
     if chunk and line.startswith(">"): # break on > if we're already collecting a chunk 
      break 
     chunk.append(line) 
print("".join(chunk)) # or whatever you want to do with it 

>TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2] 
ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG 
TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT 
GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG 
TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA 
0

這可能是相同的另一種解決方案,

def get_chuck(): 
    full_str = "" 

    # file1.txt in my case where I have mocked your data 
    with open("file1.txt") as f: 
     for line in f: 
      full_str += line 

    full_str = [">"+x for x in full_str.split(">")[1:]] 
    print full_str[0] 
    # use full_str for your need 

get_chuck() 

輸出

>TRINITY_DN19211_c0_g1_i1 len=332 path=[619:0-331] [-1, 619, -2] 
    GTCCAAGTATTACACACCGTATGATGAAGCTAACGGTGAATTTTCAAAATGTGTGAAGTT 
    TGAGAATGGGTTGCGCCCTGAGATCAAACAGGCGATTGGATACCAGAGGATTCGAAGGTT 
    TTCGGAGTTGGTAGACTGCTGCAGGATCTTTGAAGAGGATTCCAGAGCAAGGTCAACTCA 
1

如果你知道你可以使用此功能從該行的數據開始:

def extract_chunk(start_line): 
    """ 
    start_line is the line number where your data starts, counting from 0 
    """ 
    lines = [] 
    with open("data.txt") as f: 
     for i, line in enumerate(f): 
      if i == start_line: 
       lines.append(line) 
      elif not line.startswith(">") and i > start_line: 
       lines.append(line) 
      elif line.startswith(">"): 
       break 
    return "".join(lines) 
+0

使用' 「\ n」''中加入()'調用將導致雙新行作爲文件'line'已經有' 「\ n」'結尾。 – zwer

+0

是的,我的錯誤。你是否需要在你的例子中顯示的空格前加上你的線條? – genericname

+0

在我的例子中,我使用一個空字符串連接整個塊,因爲它在文件中,因爲我有一種感覺,那就是OP所瞄準的。 – zwer

0
start_ln = 4 
chunk = [] 
with open("data.txt", buffer=2**12) as f: # buffering helps for speed of processing 
    for i, ln in enumerate(f): 
     if start_ln == i: 
      chunk.append(ln) 
     elif start_ln < i: 
      chunk.append(ln) 
     elif line.startswith(">"): 
      break 
相關問題