由新行分割的有限文本塊

我在python中包含大型文本文件（超過1MiB）的字符串。我需要將它拆分爲塊。由新行分割的有限文本塊

限制：

塊只能由換行符被splited，並
LEN（塊）必須是一樣大possbile但小於LIMIT（即100KiB）

線長於LIMIT可以忽略不計。

任何想法如何在python中很好地實現這個？

預先感謝您。

來源

2017-03-31 Michał Šrajer

要拆分成新文件？ – RomanPerekhrest

沒有時間寫出來，但最好的解決方案可能是跳到LIMIT，向後工作，直到找到換行符，添加一個塊，再從那裏跳到LIMIT，然後重複。 – Linuxios

這是我不那麼Python的解決方案：

def line_chunks(lines, chunk_limit): 
    chunks = [] 
    chunk = [] 
    chunk_len = 0 
    for line in lines: 
     if len(line) + chunk_len < chunk_limit: 
      chunk.append(line) 
      chunk_len += len(line) 
     else: 
      chunks.append(chunk) 
      chunk = [line] 
      chunk_len = len(line) 
    chunks.append(chunk) 
    return chunks 

chunks = line_chunks(data.split('\n'), 150) 
print '\n---new-chunk---\n'.join(['\n'.join(chunk) for chunk in chunks])

來源

2017-03-31 21:46:29

繼Linuxios的建議，你可以使用RFIND發現在這一點上限制和組內的最後一個換行符。如果沒有找到換行符，則該塊太大並且可能被解散。

chunks = [] 

not_chunked_text = input_text 

while not_chunked_text: 
    if len(not_chunked_text) <= LIMIT: 
     chunks.append(not_chunked_text) 
     break 
    split_index = not_chunked_text.rfind("\n", 0, LIMIT) 
    if split_index == -1: 
     # The chunk is too big, so everything until the next newline is deleted 
     try: 
      not_chunked_text = not_chunked_text.split("\n", 1)[1] 
     except IndexError: 
      # No "\n" in not_chunked_text, i.e. the end of the input text was reached 
      break 
    else: 
     chunks.append(not_chunked_text[:split_index+1]) 
     not_chunked_text = not_chunked_text[split_index+1:]

rfind("\n", 0, LIMIT)返回在其中一個換行符發現你的極限的邊界內的最高指數。
not_chunked_text[:split_index+1]是需要的，以便換行符包含在塊中

我將LIMIT解釋爲允許的塊的最大長度。如果不應該允許長度爲LIMIT的塊，則必須在此代碼中添加-1之後的LIMIT。

來源

2017-03-31 22:04:22 BurningKarl

由新行分割的有限文本塊

回答

相關問題