1
我在使用NLTK 3.2的Python 3.6中工作。用Python找到所有詞序列
我試圖編寫一個程序,它將原始文本作爲輸入並輸出以相同字母(即,一致序列)開頭的任何(最大)連續單詞序列。我想忽略某些單詞和標點符號(例如'it','that','into','''',',','。'),但是要將它們包含在輸出中。
例如,輸入
"The door was ajar. So it seems that Sam snuck into Sally's subaru."
應該產生
["so", "it", "seems", "that", "sam", "snuck", "into", "sally's", "subaru"]
我是新來的編程和最好我能想出是:
import nltk
from nltk import word_tokenize
raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru."
tokened_text = word_tokenize(raw) #word tokenize the raw text with NLTK's word_tokenize() function
tokened_text = [w.lower() for w in tokened_text] #make it lowercase
for w in tokened_text: #for each word of the text
letter = w[0] #consider its first letter
allit_str = []
allit_str.append(w) #add that word to a list
pos = tokened_text.index(w) #let "pos" be the position of the word being considered
for i in range(1,len(tokened_text)-pos): #consider the next word
if tokened_text[pos+i] in {"the","a","an","that","in","on","into","it",".",",","'s"}: #if it's one of these
allit_str.append(tokened_text[pos+i]) #add it to the list
i=+1 #and move on to the next word
elif tokened_text[pos+i][0] == letter: #or else, if the first letter is the same
allit_str.append(tokened_text[pos+i]) #add the word to the list
i=+1 #and move on to the next word
else: #or else, if the letter is different
break #break the for loop
if len(allit_str)>=2: #if the list has two or more members
print(allit_str) #print it
其輸出
['ajar', '.']
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['snuck', 'into', 'sally', "'s", 'subaru', '.']
['sally', "'s", 'subaru', '.']
['subaru', '.']
這是接近我想要的,除了我不知道如何限制程序只打印最大序列。
所以我的問題是:
- 我如何修改這個代碼僅打印最大序列
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
? - 有沒有一種更簡單的方法來做到這一點在Python中,也許使用正則表達式或更優雅的代碼?
這裏有別處問類似的問題,但並沒有幫我修改我的代碼:
- How do you effectively use regular expressions to find alliterative expressions?
- A reddit challenge asking for a similar program
- 4chan question regarding counting instances of alliteration
- Blog about finding most common alliterative strings in a corpus
(我也覺得在這個網站上回答這個問題會很好。)
爲避免重複,只掃描一次字符串。擺脫for循環並使用索引來掃描字符串。跟蹤最後一個未被忽略的單詞及其第一個字母的索引。當您找到一個首字母不同的單詞時,請確定是否有足夠長的序列進行打印。 – alexis
另外你當前的代碼是buggy:如果一個單詞在句子中出現兩次,'tokened_text.index()'將總是找到第一個位置。 – alexis