蟒蛇 - 檢查字符串的一部分是在列表

睡鼠的故事。曾幾何時，有三個小姐姐;他們的名字是Elsie，Lacie和Tillie;和他們住在的好.... BADWORD底部...

和我有大約400個不好的話列表：

bad_words = ["badword", "badword1", ....]

什麼是檢查最有效的方法如果文本包含badwords列表中的壞詞？

我可以遍歷文本和列表，如：

for word in huge_string: 
    for bw in bad_words_list: 
    if bw in word: 
     # print "bad word is inside text"...

但這似乎我是從90年代..

更新：不好的話是單個單詞。

來源

2014-12-23 doniyor

因此它可以是一個子或實際的話嗎？如果單詞使用集合。 –

@PadraicCunningham現在的實際單詞 – doniyor

您是否嘗試過'set intersection'？ –

車削你的文字變成一組字，並計算它與壞字的交集將會給你攤銷速度：

text = "The Dormouse's story. Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well....badword..." 

badwords = set(["badword", "badword1", ....]) 

textwords = set(word for word in text.split()) 
for badword in badwords.intersection(textwords): 
    print("The bad word '{}' was found in the text".format(badword))

來源

2014-12-23 12:46:28 inspectorG4dget

我喜歡這個解決方案，應該比for循環嵌套'word in text'更有效率。 P.S：你在for循環中忘記了一個'in'。 – LeartS

完美！我需要確切的攤銷速度。謝謝 – doniyor

@LeartS：謝謝你的bugreport。現在修好！ – inspectorG4dget

無需獲取文本的所有的話，你可以直接檢查，如果一個字符串在另一個字符串，如：

In [1]: 'bad word' in 'do not say bad words!' 
Out[1]: True

所以，你可以這樣做：

for bad_word in bad_words_list: 
    if bad_word in huge_string: 
     print "BAD!!"

來源

2014-12-23 12:44:39 LeartS

這樣的：

st = set(s.split()) 

bad_words = ["badword", "badword1"] 
any(bad in st for bad in bad_words)

或者，如果你想要的話：

st = set(s.split()) 

bad_words = {"badword", "badword1"} 
print(st.intersection(bad_words))

如果有像凡句子中結束字badword.或badword!然後set方法會失敗，你會真正有檢查字符串中的每個單詞並檢查是否有任何壞字與單詞或子字符串相同。

st = s.split() 
any(bad in word for word in st for bad in bad_words)

來源

2014-12-23 12:46:31

您可以使用any：

爲了測試是否bad_words是前/後綴：

>>> bad_words = ["badword", "badword1"] 
>>> text ="some text with badwords or not" 
>>> any(i in text for i in bad_words) 
True 
>>> text ="some text with words or not" 
>>> any(i in text for i in bad_words) 
False

它會比較任何bad_words'項目都在text，用「子」。

爲了測試準確匹配：

>>> text ="some text with badwords or not" 
>>> any(i in text.split() for i in bad_words) 
False 
>>> text ="some text with badword or not" 
>>> any(i in text.split() for i in bad_words) 
True

它會比較任何bad_words'項目都在text.split()，也就是說，如果它是一個確切的項目。

來源

2014-12-23 12:46:41 fredtantini

s是長字符串。使用&運算符或set.intersection方法。

In [123]: set(s.split()) & set(bad_words) 
Out[123]: {'badword'} 

In [124]: bool(set(s.split()) & set(bad_words)) 
Out[124]: True

甚至更好使用set.isdisjoint。一旦找到匹配項，就會短路。

In [127]: bad_words = set(bad_words) 

In [128]: not bad_words.isdisjoint(s.split()) 
Out[128]: True 

In [129]: not bad_words.isdisjoint('for bar spam'.split()) 
Out[129]: False

來源

2014-12-23 12:47:45

-1

s = " a string with bad word" 
text = s.split() 

if any(bad_word in text for bad_word in ('bad', 'bad2')): 
     print "bad word found"

來源

2014-12-23 12:51:33

那隻會打印最後一個bad_word？如果列表中元素的「任何」都是真的（任何），任何只是返回true或false， –

上的所有優秀的答案的上方，for now, whole words條款在您的評論點在正則表達式的方向。

你可能想建立一個組合表達式像bad|otherbad|yetanother

r = re.compile("|".join(badwords)) 
r.search(text)

來源

2014-12-23 12:56:56 xtofl

我會用一個filter功能：

filter(lambda s : s in bad_words_list, huge_string.split())

來源

2014-12-23 13:10:07 Riccardo

蟒蛇 - 檢查字符串的一部分是在列表

回答

相關問題