用Python找到所有詞序列

我在使用NLTK 3.2的Python 3.6中工作。用Python找到所有詞序列

我試圖編寫一個程序，它將原始文本作爲輸入並輸出以相同字母（即，一致序列）開頭的任何（最大）連續單詞序列。我想忽略某些單詞和標點符號（例如'it'，'that'，'into'，''''，'，'，'。'），但是要將它們包含在輸出中。

例如，輸入

"The door was ajar. So it seems that Sam snuck into Sally's subaru."

應該產生

["so", "it", "seems", "that", "sam", "snuck", "into", "sally's", "subaru"]

我是新來的編程和最好我能想出是：

import nltk 
from nltk import word_tokenize 

raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru." 

tokened_text = word_tokenize(raw)     #word tokenize the raw text with NLTK's word_tokenize() function 
tokened_text = [w.lower() for w in tokened_text] #make it lowercase 

for w in tokened_text:        #for each word of the text 
    letter = w[0]         #consider its first letter 
    allit_str = [] 
    allit_str.append(w)        #add that word to a list 
    pos = tokened_text.index(w)      #let "pos" be the position of the word being considered 
    for i in range(1,len(tokened_text)-pos):  #consider the next word 
     if tokened_text[pos+i] in {"the","a","an","that","in","on","into","it",".",",","'s"}: #if it's one of these 
      allit_str.append(tokened_text[pos+i]) #add it to the list 
      i=+1         #and move on to the next word 
     elif tokened_text[pos+i][0] == letter:  #or else, if the first letter is the same 
      allit_str.append(tokened_text[pos+i]) #add the word to the list 
      i=+1         #and move on to the next word 
     else:          #or else, if the letter is different 
      break         #break the for loop 
    if len(allit_str)>=2:       #if the list has two or more members 
     print(allit_str)       #print it

其輸出

['ajar', '.'] 
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.'] 
['seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.'] 
['sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.'] 
['snuck', 'into', 'sally', "'s", 'subaru', '.'] 
['sally', "'s", 'subaru', '.'] 
['subaru', '.']

這是接近我想要的，除了我不知道如何限制程序只打印最大序列。

所以我的問題是：

我如何修改這個代碼僅打印最大序列 ['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']？
有沒有一種更簡單的方法來做到這一點在Python中，也許使用正則表達式或更優雅的代碼？

這裏有別處問類似的問題，但並沒有幫我修改我的代碼：

（我也覺得在這個網站上回答這個問題會很好。）

來源

2017-02-08 diet estus

爲避免重複，只掃描一次字符串。擺脫for循環並使用索引來掃描字符串。跟蹤最後一個未被忽略的單詞及其第一個字母的索引。當您找到一個首字母不同的單詞時，請確定是否有足夠長的序列進行打印。 – alexis

另外你當前的代碼是buggy：如果一個單詞在句子中出現兩次，'tokened_text.index（）'將總是找到第一個位置。 – alexis

有趣的任務。就我個人而言，我會循環而不使用索引，跟蹤前一個單詞與當前單詞進行比較。

此外，僅比較字母是不夠的;你必須考慮到's'和'sh'等不要重複。這是我的嘗試：

import nltk 
from nltk import word_tokenize 
from nltk import sent_tokenize 
from nltk.corpus import stopwords 
import string 
from collections import defaultdict, OrderedDict 
import operator 

raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru. She seems shy sometimes. Someone save Simon." 

# Get the English alphabet as a list of letters 
letters = [letter for letter in string.ascii_lowercase] 

# Here we add some extra phonemes that are distinguishable in text. 
# ('sailboat' and 'shark' don't alliterate, for instance) 
# Digraphs go first as we need to try matching these before the individual letters, 
# and break out if found. 
sounds = ["ch", "ph", "sh", "th"] + letters 

# Use NLTK's built in stopwords and add "'s" to them 
stopwords = stopwords.words('english') + ["'s"] # add extra stopwords here 
stopwords = set(stopwords) # sets are MUCH faster to process 

sents = sent_tokenize(raw) 

alliterating_sents = defaultdict(list) 
for sent in sents: 
    tokenized_sent = word_tokenize(sent) 

    # Create list of alliterating word sequences 
    alliterating_words = [] 
    previous_initial_sound = "" 
    for word in tokenized_sent: 
     for sound in sounds: 
      if word.lower().startswith(sound): # only lowercasing when comparing retains original case 
       initial_sound = sound 
       if initial_sound == previous_initial_sound: 
        if len(alliterating_words) > 0: 
         if previous_word == alliterating_words[-1]: # prevents duplication in chains of more than 2 alliterations, but assumes repetition is not alliteration) 
          alliterating_words.append(word) 
         else: 
          alliterating_words.append(previous_word) 
          alliterating_words.append(word) 
        else: 
         alliterating_words.append(previous_word) 
         alliterating_words.append(word)     
       break # Allows us to treat sh/s distinctly 

     # This needs to be at the end of the loop 
     # It sets us up for the next iteration 
     if word not in stopwords: # ignores stopwords for the purpose of determining alliteration 
      previous_initial_sound = initial_sound 
      previous_word = word 

    alliterating_sents[len(alliterating_words)].append(sent) 

sorted_alliterating_sents = OrderedDict(sorted(alliterating_sents.items(), key=operator.itemgetter(0), reverse=True)) 

# OUTPUT 
print ("A sorted ordered dict of sentences by number of alliterations:") 
print (sorted_alliterating_sents) 
print ("-" * 15) 
max_key = max([k for k in sorted_alliterating_sents]) # to get sent with max alliteration 
print ("Sentence(s) with most alliteration:", sorted_alliterating_sents[max_key])

這會產生一個排序的有序詞典，其頂點計數作爲其鍵。變量max_key包含最高一級或多個句子的計數，並且可用於訪問句子本身。

來源

2017-02-09 17:44:56 PrettyHands

用Python找到所有詞序列

回答

相關問題