2016-02-29 144 views
2

我有一個大樣本的文字,例如:模糊搜索的Python

「動脈高血壓可接合預後 存活病人爲併發症的結果TENSTATEN進入 框架內。 (治療) 他的(她,她的)報告(關係)效率/效果不需要的是 重要的。利尿劑,第一意向的藥物TENSTATEN, 是。

我試圖檢測是否在文本中以模糊的方式「參與預測生存」。例如「參與生存的程序」也必須返回一個肯定的答案。

我看着fuzzywuzzy,NLTK和新的正則表達式的模糊功能,但我沒有找到一個方法來做到:

if [anything similar (>90%) to "that sentence"] in mybigtext: 
    print True 
+0

即時通訊新的在這裏,但我認爲這應該解決您的問題:http://stackoverflow.com/questions/30449452/python-fuzzy-text-search?rq=1 –

+0

看看[gensim](https:/ /radimrehurek.com/gensim/index.html),特別是[相似部分](https://radimrehurek.com/gensim/tut3.html)。 – Jan

回答

0

有低於此,如果一個字包含的文本它將裏面的函數顯示一個匹配。您可以即興創作,以便在文本中檢查完整的短語。

這是我提出的功能:

def FuzzySearch(text, phrase): 
    """Check if word in phrase is contained in text""" 
    phrases = phrase.split(" ") 

    for x in range(len(phrases)): 
     if phrases[x] in text: 
      print("Match! Found " + phrases[x] + " in text") 
     else: 
      continue 
+0

是啊,這是我的第一次猜測,但沒辦法使句子明智模糊... –

1

以下是不理想,但它應該讓你開始。它首先使用nltk將文本分成單詞,然後生成一個包含所有單詞的詞幹的集合,過濾任何停用詞。它可以爲您的示例文本和示例查詢做到這一點。

如果兩個集合的交集包含查詢中的所有單詞,則認爲它是匹配的。

import nltk 

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 

stop_words = stopwords.words('english') 
ps = PorterStemmer() 

def get_word_set(text): 
    return set(ps.stem(word) for word in word_tokenize(text) if word not in stop_words) 

text1 = "The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 
text2 = "The arterial high blood pressure may engage the for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 

query = "engage the prognosis for survival" 

set_query = get_word_set(query) 
for text in [text1, text2]: 
    set_text = get_word_set(text) 
    intersection = set_query & set_text 

    print "Query:", set_query 
    print "Test:", set_text 
    print "Intersection:", intersection 
    print "Match:", len(intersection) == len(set_query) 
    print 

該腳本提供兩個文本,一個通行證和其他沒有,它產生以下輸出向您展示它在做什麼:

Query: set([u'prognosi', u'engag', u'surviv']) 
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'framework', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'prognosi', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first']) 
Intersection: set([u'prognosi', u'engag', u'surviv']) 
Match: True 

Query: set([u'prognosi', u'engag', u'surviv']) 
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'framework', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first']) 
Intersection: set([u'engag', u'surviv']) 
Match: False 
+0

是的,我想過這種可能性! 如果我真的找不到任何其他解決方案,我會使用那個!謝謝 ! –

1

使用regex模塊,第一次分裂的句子然後測試是否模糊圖案是在句子:

tgt="The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 

for sentence in regex.split(r'(?<=[.?!;])\s+(?=\p{Lu})', tgt): 
    pat=r'(?e)((?:has engage the progronosis of survival){e<%i})' 
    pat=pat % int(len(pat)/5) 
    m=regex.search(pat, sentence) 
    if m: 
     print "'{}'\n\tfuzzy matches\n'{}'\n\twith \n{} substitutions, {} insertions, {} deletions".format(pat,m.group(1), *m.fuzzy_counts) 

打印:

'(?e)((?:has engage the progronosis of survival){e<10})' 
    fuzzy matches 
'may engage the prognosis for survival' 
    with 
3 substitutions, 1 insertions, 2 deletions 
+0

因此,通過玩數字模糊數字像限制他們......我可以做一些事情之間的區別:'已經搞預後'和'不搞預後' 這似乎是完美的感謝!如果是這種情況,我會盡力解決我的問題。 –