從使用熊貓的文本確定上下文

我已經構建了一個抓取我數據的網絡抓取工具。數據通常是結構化的。但是，然後有一些異常。現在要對數據頂部進行分析，我正在尋找幾個詞，即searched_words=['word1','word2','word3'......]我想要這些詞出現的句子。所以我編寫如下：從使用熊貓的文本確定上下文

searched_words=['word1','word2','word3'......] 

fsa = re.compile('|'.join(re.escape(w.lower()) for w in searched_words)) 
str_df['context'] = str_df['text'].apply(lambda text: [sent for sent in  sent_tokenize(text) 
if any(True for w in word_tokenize(sent) if w.lower() in words)])

這是工作，但我面臨的問題是，如果有/缺少的空格在文本中的句號後，我收到的所有這樣的句子本身。

例子：

searched_words = ['snakes','venomous'] 
text = "I am afraid of snakes.I hate them." 
output : ['I am afraid of snakes.I hate them.'] 
Desired output : ['I am afraid of snakes.']

來源

2016-11-30 user7140275

您可以顯示或共享您正在處理的數據樣本嗎？ –

@RohanAmrute它和我在問題中已經說明的例子類似。 – user7140275

tokenize（）中發生了什麼？你能代替'。'嗎？與'。「？點和空間 – themistoklik

如果所有的斷詞（包括NLTK）失敗，你可以採取事態入你自己的手中，並嘗試

import re 
s='I am afraid of snakes.I hate venomous them. Theyre venomous.' 
def findall(s,p): 
    return [m.start() for m in re.finditer(p, s)] 

def find(sent, word): 
    res=[] 
    indexes = findall(sent,word) 

    for index in indexes: 
    i = index 
    while i>0: 
     if sent[i]!='.': 
     i-=1 
     else: 
     break 
    end = index+len(word) 

    nextFullStop = end + sent[end:].find('.') 

    res.append(sent[i:nextFullStop]) 
    i=0 
    return res

玩它here。這裏還有一些點，因爲我不知道你想要怎麼做。

它會發現所有單詞的出現，並將句子返回到上一個點。這僅適用於邊緣情況，但您可以根據需要輕鬆調整它。

來源

2016-11-30 10:17:24 themistoklik

從使用熊貓的文本確定上下文

回答

相關問題