下面介紹一種方法,它應該是非常高效的。 我們通過分組電子郵件地址的長度來完成此操作,以便我們只需檢查每個電子郵件地址是否與下一級相匹配,即通過分片和設置成員資格檢查。
的代碼:
首先,在數據讀出:
import pandas as pd
import numpy as np
string = '''
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
'''
x = pd.DataFrame({'x':string.split()})
#remove duplicates:
x = x[~x.x.duplicated()]
我們剝去@ foo.bar部分,然後濾波器,只有那些以數字結尾,並添加上'長度'列:
#split on @, expand means into two columns
emails = x.x.str.split('@', expand = True)
#filter by last in string is a digit
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()]
#add a length of email column for the next step
emails['lengths'] = emails.loc[:,0].str.len()
現在,我們要做的就是取每個長度和長度-1,看看長度是多少。與它的最後一個字符下降,出現在一組n-1個長度的(並且,我們要檢查,如果情況正好相反,在情況下,它是最短的重複):
#unique lengths to check
lengths = emails.lengths.unique()
#mask to hold results
mask = pd.Series([0]*len(emails), index = emails.index)
#for each length
for j in lengths:
#we subset those of that length
totest = emails['lengths'] == j
#and those who might be the shorter version
against = emails['lengths'] == j -1
#we make a set of unique values, for a hashed lookup
againstset = set([i for i in emails.loc[against,0]])
#we cut off the last char of each in to test
tests = emails.loc[totest,0].str[:-1]
#we check matches, by checking the set
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
#viceversa, otherwise we miss the smallest one in the group
againstset = set([i for i in emails.loc[totest,0].str[:-1]])
tests = emails.loc[against,0]
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
得到的面膜可以轉換爲布爾值,並用於子集的原始(重複數據刪除)數據框,以及指數應與原指數於子集這樣的:
x.loc[~mask.astype(bool),:]
x
0 [email protected]
16 [email protected]
17 [email protected]
你可以看到,我們沒有刪除你的第一個值,作爲「 「。意味着它不匹配 - 您可以先刪除標點符號。
澄清 - 您提供的示例中哪些電子郵件地址會標記爲欺詐? – Nicarus
所有這些例子都是欺詐性的 – jeangelj
所以[email protected]會好的,但是像abc1,abc12 ......這樣的東西都是欺騙性的?如果存在[email protected],這些只會是欺詐行爲? – Nicarus