2016-10-22 76 views
-1

好的,所以我試圖從文本文件中找到最長的鏈,其中一行的最後一個單詞是下一個詞的第一個單詞(適用於詩歌)。我所需要的Python腳本運行良好,但仍需要很長時間。我不是編碼專家,完全不知道優化。我是否需要更多選擇? 如何縮短運行較長文本所需的時間?行最後一個詞的最長鏈/下一個第一個詞

#!/usr/bin/python 
# -*- coding: utf-8 -*- 
import re 
import sys 

# Opening the source text 
with open("/text.txt") as g: 
    all_lines = g.readlines() 

def last_word(particular_line): 
    if particular_line != "\n": 
     particular_line = re.sub(ur'^\W*|\W*$', "",particular_line) 
     if len(particular_line) > 1: 
      return particular_line.rsplit(None, 1)[-1].lower() 

def first_word(particular_line): 
    if particular_line != "\n": 
     particular_line = re.sub(ur'^\W*|\W*$', "",particular_line) 
     if len(particular_line) > 1: 
      return particular_line.split(None, 1)[0].lower() 

def chain(start, lines, depth): 
    remaining = list(lines) 
    del remaining[remaining.index(start)] 
    possibles = [x for x in remaining if (len(x.split()) > 2) and (first_word(x) == last_word(start))] 
    maxchain = [] 
    for c in possibles: 
     l = chain(c, remaining, depth) 
     sys.stdout.flush() 
     sys.stdout.write(str(depth) + " of " + str(len(all_lines)) + " \r") 
     sys.stdout.flush() 
     if len(l) > len(maxchain): 
      maxchain = l 
      depth = str(depth) + "." + str(len(maxchain)) 
    return [start] + maxchain 

#Start 
final_output = [] 

#Finding the longest chain 

for i in range (0, len(all_lines)): 
    x = chain(all_lines[i], all_lines, i) 
    if len(x) > 2: 
     final_output.append(x) 
final_output.sort(key = len) 

#Output on screen 
print "\n\n--------------------------------------------" 

if len(final_output) > 1: 
    print final_output[-1] 
else: 
    print "Nothing found" 
+1

你能提供這樣的行的例子嗎? –

回答

1
import itertools 
def matching_lines(line_pair): 
    return line_pair[0].split()[-1].lower() == line_pair[1].split()[0].lower() 

line_pairs = ((line,next_line) for line,next_line in itertools.izip(all_lines,all_lines[1:])) 
grouped_pairs = itertools.groupby(line_pairs,matching_lines) 
print max([len(list(y))+1 for x,y in grouped_pairs if x]) 

雖然我不確定它會更快(但我認爲這將是因爲它僅遍歷一次,並大都採用建宏)

+1

不錯的解決方案。在Python 3.5中,因爲y是一個生成器,所以它需要是'len(list(y))'。總數也是1短。 –

+0

哈哈感謝:P ... –

0

是的,這個代碼具有的複雜性$ O(N^2)$。這意味着如果你的文件有n行,那麼你的代碼執行的迭代次數爲1 *(n-1)第一行,然後1 *(n-2)第二行等n個這樣的元素。對於大n,這相當於$ n^2 $。其實,有一個代碼在這一行

del remaining[remaining.index(start)] 

,你可能是指運行這個錯誤:

del remaining[:remaining.index(start)] 

(注意:在方括號「」),它擴展了運行時(現(n-1)+(n-1)+(n-1)= n *(n-1) -3)..)。
您可以優化代碼,如下所示:從maxchainlen = 0開始,curchainlen = 0。現在,遍歷行,每次比較當前行的第一個單詞和上一行的最後一個單詞。如果它們匹配,則將curchainlen加1。如果它們不匹配,則檢查是否存在curchainlen,如果是,則分配maxchainlen = curchainlen,並將curchainlen初始化爲0.在完成對行的迭代後,再次對maxchainlen進行檢查。例如:

lw = last_word(lines[0]) 
curchainlen = 0 
maxchainlen = 0 
for l in lines[2:]: 
    if lw = first_word(l): 
     curchainlen = curchainlen + 1 
    else: 
     maxchainlen = max(maxchainlen, curchainlen) 
     curchainlen = 0 
maxchainlen = max(maxchainlen, curchainlen) 
print(maxchainlen) 
+0

新代碼的複雜性爲O(n)。所以對於長文件,它將大大提高性能。 – galra

0

我想嘗試拆分這個工作分爲兩個階段:第一找到鏈,然後對它們進行比較。這將簡化代碼很多。由於鏈是文件中所有行的一小部分,因此首先找到它們然後對它們進行排序會比試圖在一個大的過程中處理整個事件更快。

如果您使用python yield關鍵字,該問題的第一部分更容易,該關鍵字與return類似,但不會結束函數。這使您可以一次一行地循環播放內容,並且無需在整個內存中始終保存整個內容。

下面是一次抓取一行文件的基本方法。它採用yield拔出鏈,因爲它發現他們

def get_chains(*lines): 
    # these hold the last token and the 
    # members of this chain 
    previous = None 
    accum = [] 

    # walk through the lines, 
    # seeing if they can be added to the existing chain in `accum` 
    for each_line in lines: 
     # split the line into words, ignoring case & whitespace at the ends 
     pieces = each_line.lower().strip().split(" ") 
     if pieces[0] == previous: 
      # match? add to accum 
      accum.append(each_line) 
     else: 
      # no match? yield our chain 
      # if it is not empty 
      if accum: 
       yield accum 
       accum = [] 
     # update our idea of the last, and try the next line 
     previous = pieces[-1] 

    # at the end of the file we need to kick out anything 
    # still in the accumulator 
    if accum: 
     yield accum 

當你給這個函數線的字符串,它會yield出鏈,如果發現它們,然後繼續。 呼叫這個功能可以捕獲成功的鏈,並與他們做事。

一旦你有鏈子,很容易按長度排序並選擇最長的。由於Python具有內置列表排序功能,只需收集一行行長 - >行對並對其進行排序即可。最長的一行將是最後一項:

def longest_chain(filename): 

    with open (filename, 'rt') as file_handle: 
     # if you loop over an open file, you'll get 
     # back the lines in the file one at a time 
     incoming_chains = get_chains(*file_handle) 

     # collect the results into a list, keyed by lengths 
     all_chains = [(len(chain), chain) for chain in incoming_chains] 
     if all_chains: 
      all_chains.sort() 
      length, lines = all_chains[-1] 
      # found the longest chain 
      return "\n".join(lines) 
     else: 
      # for some reason there are no chains of connected lines 
      return [] 
相關問題