「動脈高血壓可接合預後 存活病人爲併發症的結果TENSTATEN進入 框架內。 (治療) 他的(她,她的)報告(關係)效率/效果不需要的是 重要的。利尿劑,第一意向的藥物TENSTATEN, 是。



if [anything similar (>90%) to "that sentence"] in mybigtext: 
    print True 

即時通訊新的在這裏,但我認爲這應該解決您的問題:http://stackoverflow.com/questions/30449452/python-fuzzy-text-search?rq=1 –


看看[gensim](https:/ /radimrehurek.com/gensim/index.html),特別是[相似部分](https://radimrehurek.com/gensim/tut3.html)。 – Jan





def FuzzySearch(text, phrase): 
    """Check if word in phrase is contained in text""" 
    phrases = phrase.split(" ") 

    for x in range(len(phrases)): 
     if phrases[x] in text: 
      print("Match! Found " + phrases[x] + " in text") 

是啊,這是我的第一次猜測,但沒辦法使句子明智模糊... –




import nltk 

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 

stop_words = stopwords.words('english') 
ps = PorterStemmer() 

def get_word_set(text): 
    return set(ps.stem(word) for word in word_tokenize(text) if word not in stop_words) 

text1 = "The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 
text2 = "The arterial high blood pressure may engage the for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 

query = "engage the prognosis for survival" 

set_query = get_word_set(query) 
for text in [text1, text2]: 
    set_text = get_word_set(text) 
    intersection = set_query & set_text 

    print "Query:", set_query 
    print "Test:", set_text 
    print "Intersection:", intersection 
    print "Match:", len(intersection) == len(set_query) 


Query: set([u'prognosi', u'engag', u'surviv']) 
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'framework', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'prognosi', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first']) 
Intersection: set([u'prognosi', u'engag', u'surviv']) 
Match: True 

Query: set([u'prognosi', u'engag', u'surviv']) 
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'framework', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first']) 
Intersection: set([u'engag', u'surviv']) 
Match: False 

是的,我想過這種可能性! 如果我真的找不到任何其他解決方案,我會使用那個!謝謝 ! –



tgt="The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency/effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous." 

for sentence in regex.split(r'(?<=[.?!;])\s+(?=\p{Lu})', tgt): 
    pat=r'(?e)((?:has engage the progronosis of survival){e<%i})' 
    pat=pat % int(len(pat)/5) 
    m=regex.search(pat, sentence) 
    if m: 
     print "'{}'\n\tfuzzy matches\n'{}'\n\twith \n{} substitutions, {} insertions, {} deletions".format(pat,m.group(1), *m.fuzzy_counts) 


'(?e)((?:has engage the progronosis of survival){e<10})' 
    fuzzy matches 
'may engage the prognosis for survival' 
3 substitutions, 1 insertions, 2 deletions 

因此,通過玩數字模糊數字像限制他們......我可以做一些事情之間的區別:'已經搞預後'和'不搞預後' 這似乎是完美的感謝!如果是這種情況,我會盡力解決我的問題。 –