2014-07-25 59 views
6

我試圖在給定的文本中打印短語。我希望能夠打印文本中的每個短語,從2個單詞到文本長度允許的最大單詞數。我寫了一個下面的程序,打印所有長度最多5個字的短語,但我無法找到一個更優雅的方式來打印所有可能的短語。在給定字符串中打印所有可能的短語(單詞的連續組合)

我的短語定義=字符串中的連續詞,無論意義如何。

def phrase_builder(i): 
    phrase_length = 4 
    phrase_list = [] 
    for x in range(0, len(i)-phrase_length): 
     phrase_list.append(str(i[x]) + " " + str(i[x+1])) 
     phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2])) 
     phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3])) 
     phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]) + " " + str(i[x+4])) 
    return phrase_list 

text = "the big fat cat sits on the mat eating a rat" 

print phrase_builder(text.split()) 

這個輸出是:

['the big', 'the big fat', 'the big fat cat', 'the big fat cat sits', 
'big fat', 'big fat cat', 'big fat cat sits', 'big fat cat sits on', 
'fat cat', 'fat cat sits', 'fat cat sits on', 'fat cat sits on the', 
'cat sits', 'cat sits on', 'cat sits on the', 'cat sits on the mat', 
'sits on', 'sits on the', 'sits on the mat', 'sits on the mat eating', 
'on the', 'on the mat', 'on the mat eating', 'on the mat eating a', 
'the mat', 'the mat eating', 'the mat eating a', 'the mat eating a rat'] 

我希望能夠打印短語,如"the big fat cat sits on the mat eating""fat cat sits on the mat eating a rat"

任何人都可以提供一些建議嗎?

+2

你不是也想短語,比如'吃rat'? – TheSoundDefense

+0

@TheSoundDefense好點。是的,我願意。 – MLadbrook

回答

11

只需使用itertools.combinations

from itertools import combinations 
text = "the big fat cat sits on the mat eating a rat" 
lst = text.split() 
for start, end in combinations(range(len(lst)), 2): 
    print lst[start:end+1] 

輸出:

['the', 'big'] 
['the', 'big', 'fat'] 
['the', 'big', 'fat', 'cat'] 
['the', 'big', 'fat', 'cat', 'sits'] 
['the', 'big', 'fat', 'cat', 'sits', 'on'] 
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the'] 
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat'] 
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating'] 
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a'] 
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat'] 
['big', 'fat'] 
['big', 'fat', 'cat'] 
['big', 'fat', 'cat', 'sits'] 
['big', 'fat', 'cat', 'sits', 'on'] 
['big', 'fat', 'cat', 'sits', 'on', 'the'] 
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat'] 
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating'] 
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a'] 
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat'] 
['fat', 'cat'] 
['fat', 'cat', 'sits'] 
['fat', 'cat', 'sits', 'on'] 
['fat', 'cat', 'sits', 'on', 'the'] 
['fat', 'cat', 'sits', 'on', 'the', 'mat'] 
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating'] 
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a'] 
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat'] 
['cat', 'sits'] 
['cat', 'sits', 'on'] 
['cat', 'sits', 'on', 'the'] 
['cat', 'sits', 'on', 'the', 'mat'] 
['cat', 'sits', 'on', 'the', 'mat', 'eating'] 
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a'] 
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat'] 
['sits', 'on'] 
['sits', 'on', 'the'] 
['sits', 'on', 'the', 'mat'] 
['sits', 'on', 'the', 'mat', 'eating'] 
['sits', 'on', 'the', 'mat', 'eating', 'a'] 
['sits', 'on', 'the', 'mat', 'eating', 'a', 'rat'] 
['on', 'the'] 
['on', 'the', 'mat'] 
['on', 'the', 'mat', 'eating'] 
['on', 'the', 'mat', 'eating', 'a'] 
['on', 'the', 'mat', 'eating', 'a', 'rat'] 
['the', 'mat'] 
['the', 'mat', 'eating'] 
['the', 'mat', 'eating', 'a'] 
['the', 'mat', 'eating', 'a', 'rat'] 
['mat', 'eating'] 
['mat', 'eating', 'a'] 
['mat', 'eating', 'a', 'rat'] 
['eating', 'a'] 
['eating', 'a', 'rat'] 
['a', 'rat'] 
+0

@dvm我不認爲有一個最大長度,只是OP只有5個連續的單詞。 –

+0

@KirkStrauser我的錯誤!真的很好的方法,順便說一句! – denisvm

+0

@KirkStrauser:他爲最大短語長度創建了一個變量(即使他從未結束使用它)這一事實讓我覺得可能有一個最大值。 – abarnert

2

首先,您需要弄清楚如何以同樣的方式寫出所有這四行。除了手動串聯詞和空間,用join方法:

phrase_list.append(" ".join(str(i[x+y]) for y in range(2)) 
phrase_list.append(" ".join(str(i[x+y]) for y in range(3)) 
phrase_list.append(" ".join(str(i[x+y]) for y in range(4)) 
phrase_list.append(" ".join(str(i[x+y]) for y in range(5)) 

如果join方法中的理解是不明確的,這裏是如何將其手動寫:

phrase = [] 
for y in range(2): 
    phrase.append(str(i[x+y])) 
phrase_list.append(" ".join(phrase)) 

一旦你」已經做到了,用循環代替這四條線很簡單:

for length in range(2, phrase_length): 
    phrase_list.append(" ".join(str(i[x+y]) for y in range(length)) 

您可以通過其他幾種獨立方式來簡化它。

首先,i[x+y] for y in range(length)可以通過切片更容易地完成:i[x:x+length]

而我猜i已經是一個字符串列表,所以你可以擺脫str調用。

此外,range默認從0開始,因此您可以將其關閉。

雖然我們在這,但如果使用有意義的變量名稱(如words而不是i),考慮一下代碼會容易得多。

所以:

def phrase_builder(words): 
    phrase_length = 4 
    phrase_list = [] 
    for i in range(len(words) - phrase_length): 
     phrase_list.append(" ".join(words[i:i+phrase_length])) 
    return phrase_list 

現在你的循環是很簡單的,你可以把它變成一個理解和整個事情是一個班輪:

def phrase_builder(words): 
    phrase_length = 4 
    return [" ".join(words[i:i+phrase_length]) 
      for i in range(len(words) - phrase_length)] 

最後一件事:由於@SoundDefense問,你確定你不想「吃老鼠」嗎?它從最後開始少於5個單詞,但在文本中是3個單詞。

如果你確實需要這樣做,只需刪除- phrase_length部分即可。

1

你需要有一個系統的方法來枚舉每一個可能的短語。

一種方法是從每個單詞開始,然後生成以該單詞開頭的所有可能的短語。

def phrase_builder(my_words): 
    phrases = [] 
    for i, word in enumerate(my_words): 
    phrases.append(word) 
    for nextword in my_words[i+1:]: 
     phrases.append(phrases[-1] + " " + nextword) 
    # Remove the one-word phrase. 
    phrases.remove(word) 
    return phrases 



text = "the big fat cat sits on the mat eating a rat" 

print phrase_builder(text.split()) 
+0

這是有效的,除了它包括所有長度的短語,而不僅僅是通過'phrase_length'的長度'2'。希望他能看到自己如何適應它,所以+1。 – abarnert

+0

@abarnert,好點。我會更新它。 – merlin2011

1

我認爲最簡單的辦法是遍歷所有在words列表中可能startend位置並生成短語詞的各個子列表:

def phrase_builder(words): 
    for start in range(0, len(words)-1): 
     for end in range(start+2, len(words)+1): 
      yield ' '.join(words[start:end]) 

text = "the big fat cat sits on the mat eating a rat" 
for phrase in phrase_builder(text.split()): 
    print phrase 

輸出:

the big 
the big fat 
... 
the big fat cat sits on the mat eating a rat 
... 
sits on the mat eating a 
... 
eating a rat 
a rat 
+0

你打敗了我。除了函數的名稱外,這幾乎就是我寫的答案。我喜歡itertools版本,因爲itertools非常好。不過,我非常喜歡這種顯式可讀性。 –

相關問題