2017-02-08 88 views
1

我在使用NLTK 3.2的Python 3.6中工作。用Python找到所有詞序列

我試圖編寫一個程序,它將原始文本作爲輸入並輸出以相同字母(即,一致序列)開頭的任何(最大)連續單詞序列。我想忽略某些單詞和標點符號(例如'it','that','into','''',',','。'),但是要將它們包含在輸出中。

例如,輸入

"The door was ajar. So it seems that Sam snuck into Sally's subaru." 

應該產生

["so", "it", "seems", "that", "sam", "snuck", "into", "sally's", "subaru"] 

我是新來的編程和最好我能想出是:

import nltk 
from nltk import word_tokenize 

raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru." 

tokened_text = word_tokenize(raw)     #word tokenize the raw text with NLTK's word_tokenize() function 
tokened_text = [w.lower() for w in tokened_text] #make it lowercase 

for w in tokened_text:        #for each word of the text 
    letter = w[0]         #consider its first letter 
    allit_str = [] 
    allit_str.append(w)        #add that word to a list 
    pos = tokened_text.index(w)      #let "pos" be the position of the word being considered 
    for i in range(1,len(tokened_text)-pos):  #consider the next word 
     if tokened_text[pos+i] in {"the","a","an","that","in","on","into","it",".",",","'s"}: #if it's one of these 
      allit_str.append(tokened_text[pos+i]) #add it to the list 
      i=+1         #and move on to the next word 
     elif tokened_text[pos+i][0] == letter:  #or else, if the first letter is the same 
      allit_str.append(tokened_text[pos+i]) #add the word to the list 
      i=+1         #and move on to the next word 
     else:          #or else, if the letter is different 
      break         #break the for loop 
    if len(allit_str)>=2:       #if the list has two or more members 
     print(allit_str)       #print it 

其輸出

['ajar', '.'] 
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.'] 
['seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.'] 
['sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.'] 
['snuck', 'into', 'sally', "'s", 'subaru', '.'] 
['sally', "'s", 'subaru', '.'] 
['subaru', '.'] 

這是接近我想要的,除了我不知道如何限制程序只打印最大序列。

所以我的問題是:

  1. 我如何修改這個代碼僅打印最大序列 ['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
  2. 有沒有一種更簡單的方法來做到這一點在Python中,也許使用正則表達式或更優雅的代碼?

這裏有別處問類似的問題,但並沒有幫我修改我的代碼:

(我也覺得在這個網站上回答這個問題會很好。)

+0

爲避免重複,只掃描一次字符串。擺脫for循環並使用索引來掃描字符串。跟蹤最後一個未被忽略的單詞及其第一個字母的索引。當您找到一個首字母不同的單詞時,請確定是否有足夠長的序列進行打印。 – alexis

+2

另外你當前的代碼是buggy:如果一個單詞在句子中出現兩次,'tokened_text.index()'將總是找到第一個位置。 – alexis

回答

1

有趣的任務。就我個人而言,我會循環而不使用索引,跟蹤前一個單詞與當前單詞進行比較。

此外,僅比較字母是不夠的;你必須考慮到's'和'sh'等不要重複。這是我的嘗試:

import nltk 
from nltk import word_tokenize 
from nltk import sent_tokenize 
from nltk.corpus import stopwords 
import string 
from collections import defaultdict, OrderedDict 
import operator 

raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru. She seems shy sometimes. Someone save Simon." 

# Get the English alphabet as a list of letters 
letters = [letter for letter in string.ascii_lowercase] 

# Here we add some extra phonemes that are distinguishable in text. 
# ('sailboat' and 'shark' don't alliterate, for instance) 
# Digraphs go first as we need to try matching these before the individual letters, 
# and break out if found. 
sounds = ["ch", "ph", "sh", "th"] + letters 

# Use NLTK's built in stopwords and add "'s" to them 
stopwords = stopwords.words('english') + ["'s"] # add extra stopwords here 
stopwords = set(stopwords) # sets are MUCH faster to process 

sents = sent_tokenize(raw) 

alliterating_sents = defaultdict(list) 
for sent in sents: 
    tokenized_sent = word_tokenize(sent) 

    # Create list of alliterating word sequences 
    alliterating_words = [] 
    previous_initial_sound = "" 
    for word in tokenized_sent: 
     for sound in sounds: 
      if word.lower().startswith(sound): # only lowercasing when comparing retains original case 
       initial_sound = sound 
       if initial_sound == previous_initial_sound: 
        if len(alliterating_words) > 0: 
         if previous_word == alliterating_words[-1]: # prevents duplication in chains of more than 2 alliterations, but assumes repetition is not alliteration) 
          alliterating_words.append(word) 
         else: 
          alliterating_words.append(previous_word) 
          alliterating_words.append(word) 
        else: 
         alliterating_words.append(previous_word) 
         alliterating_words.append(word)     
       break # Allows us to treat sh/s distinctly 

     # This needs to be at the end of the loop 
     # It sets us up for the next iteration 
     if word not in stopwords: # ignores stopwords for the purpose of determining alliteration 
      previous_initial_sound = initial_sound 
      previous_word = word 

    alliterating_sents[len(alliterating_words)].append(sent) 

sorted_alliterating_sents = OrderedDict(sorted(alliterating_sents.items(), key=operator.itemgetter(0), reverse=True)) 

# OUTPUT 
print ("A sorted ordered dict of sentences by number of alliterations:") 
print (sorted_alliterating_sents) 
print ("-" * 15) 
max_key = max([k for k in sorted_alliterating_sents]) # to get sent with max alliteration 
print ("Sentence(s) with most alliteration:", sorted_alliterating_sents[max_key]) 

這會產生一個排序的有序詞典,其頂點計數作爲其鍵。變量max_key包含最高一級或多個句子的計數,並且可用於訪問句子本身。