使用NLTK在Python中的文件的特定區域中使用sent_tokenize？

我有一個包含數千個句子的文件，我想找到包含特定字符/單詞的句子。使用NLTK在Python中的文件的特定區域中使用sent_tokenize？

最初，我將整個文件標記爲（使用sent_tokenize），然後遍歷句子來查找單詞。但是，這太慢了。由於我可以很快找到這些詞的索引，我可以使用這個對我有利嗎？是否有一種方法可以標記一個單詞周圍的區域（即找出哪個句子包含一個單詞）？

謝謝。

編輯：我在Python中，並使用NLTK庫。

2012-12-06 user1881006

您使用的平臺是？在Unix/Linux/MacOS的/ Cygwin的，你可以做到以下幾點：

sed 's/[\.\?\!]/\n/' < myfile | grep 'myword'

這將只顯示包含您的字線（與sed中會得到一個非常粗略的斷詞組成句子）。如果你想要一種特定語言的解決方案，你應該說你正在使用的是什麼！

編輯的Python：

下面的工作---它只有在有你的字正則表達式匹配（這是一個非常快速的操作）調用的標記化。這意味着你只能標記包含你想要的單詞的行：

import re 
import os.path 

myword = 'using' 
fname = os.path.abspath('path/to/my/file') 

try: 
    f = open(fname) 

    matching_lines = list(l for l in f if re.search(r'\b'+myword+r'\b', l)) 
    for match in matching_lines: 
     #do something with matching lines 
     sents = sent_tokenize(match) 
except IOError: 
    print "Can't open file "+fname 
finally: 
    f.close()

來源

2012-12-06 10:09:48

糟糕，我正在使用python和nltk庫。 – user1881006

我將添加一個Python版本，然後 –

感謝您的更新。但是我在句子之間沒有換行符。我的問題是，我擁有大量的文本，而且我不知道邊界在哪裏（因此我需要在這個詞的附近加多少字）。 – user1881006

這裏有一個想法可能會加快搜索速度。您可以創建一個附加列表，其中存儲運行總計的大字中每個句子的單詞計數。使用我從Alex Martelli瞭解到的發電機功能，嘗試如下所示：

def running_sum(a): 
    tot = 0 
    for item in a: 
    tot += item 
    yield tot 

from nltk.tokenize import sent_tokenize 

sen_list = sent_tokenize(bigtext) 
wc = [len(s.split()) for s in sen_list] 
runningwc = list(running_sum(wc)) #list of the word count for each sentence (running total for the whole text) 

word_index = #some number that you get from word index 

for index,w in enumerate(runningwc): 
    if w > word_index: 
     sentnumber = index-1 #found the index of the sentence that contains the word 
     break 

print sen_list[sentnumber]

希望這個想法有所幫助。

UPDATE：如果sent_tokenize是慢的，那麼你可以嘗試完全避免它。使用已知的索引在大文本中查找單詞。

現在，逐個字符前後移動，以檢測句末和句子開始。類似於「[。！？]」（一段時間，感嘆號或問號，後跟一個空格）會表示和判斷開始和結束。你只會在目標詞的附近搜索，所以它應該比sent_tokenize快得多。

來源

2012-12-07 07:57:02

感謝您的想法！明天我必須仔細觀察，但我認爲對我來說最慢的部分實際上是'sen_list = sent_tokenize（bigtext）'。（令牌化器，也就是說）令人驚訝的是，遍歷句子並不算太壞，儘管我喜歡你的想法。 – user1881006

是的，我希望send_tokenize可以搜索附近的一個詞（從那裏向外工作。）我真的需要send_tokenize，因爲它足夠聰明，可以用NLP和全部縮略詞來縮短縮寫。 – user1881006

使用NLTK在Python中的文件的特定區域中使用sent_tokenize？

回答

相關問題