2017-06-16 38 views
2

我正在使用我正在刪除的欺詐性電子郵件地址清理數據集。python:數據清理 - 檢測欺詐性電子郵件地址的模式

我建立了多個捕獲重複和欺詐域的規則。但有一個screnario,我不知道如何編寫python中的規則來標記它們。

所以我有這樣的例子規則:

#delete punction 
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))  

#flag yopmail 
pattern = "yopmail" 
match = df['email'].str.contains(pattern) 
df['yopmail'] = np.where(match, 'Y', '0') 

#flag duplicates 
df['duplicate']=df.email.duplicated(keep=False) 

這是我不能想出一個規則來抓住它的數據。基本上我正在尋找一種方法來標記以相同方式開始的地址,但最後會有連續的數字。

[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
+0

澄清 - 您提供的示例中哪些電子郵件地址會標記爲欺詐? – Nicarus

+0

所有這些例子都是欺詐性的 – jeangelj

+0

所以[email protected]會好的,但是像abc1,abc12 ......這樣的東西都是欺騙性的?如果存在[email protected],這些只會是欺詐行爲? – Nicarus

回答

1

我的解決辦法是效率不高,也不好看。但檢查一下,看看它是否適合你@jeangelj。它絕對適用於您提供的示例。祝你好運!

import os 
from random import shuffle 
from difflib import SequenceMatcher 

emails = [... ...] # for example the 16 email addresses you gave in your question 
shuffle(emails) # everyday i'm shuffling 
emails = sorted(emails) # sort that shit! 
names = [email.split('@')[0] for email in emails] 

T = 0.7 # <- set your string similarity threshold here!! 

split_indices=[] 
for i in range(1,len(emails)): 
    if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T: 
     split_indices.append(i) # we want to remember where dissimilar email address occurs 

grouped=[] 
for i in split_indices: 
    grouped.append(emails[:i]) 
grouped.append(emails[i:]) 
# now we have similar email addresses grouped, we want to find the common prefix for each group 
prefix_strings=[] 
for group in grouped: 
    prefix_strings.append(os.path.commonprefix(group)) 

# finally 
ham=[] 
spam=[] 
true_ids = [names.index(p) for p in prefix_strings] 
for i in range(len(emails)): 
    if i in true_ids: 
     ham.append(emails[i]) 
    else: 
     spam.append(emails[i]) 

In [30]: ham 
Out[30]: ['[email protected]', '[email protected]'] 

In [31]: spam 
Out[31]: 
['[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]'] 

# THE TRUTH YALL! 
1

首先來看看正則表達式的問題here

其次,儘量過濾電子郵件地址一樣,:

# Let's email is = '[email protected]' 
email = '[email protected]' 
email_name = email.split(',', maxsplit=1)[0] 
# Here you get email_name = 'attn1234 
import re 
m = re.search(r'\d+$', email_name) 
# if the string ends in digits m will be a Match object, or None otherwise. 
if m is not None: 
    print ('%s is good' % email) 
else: 
    print ('%s is BAD' % email) 
+0

謝謝!在我發佈我的問題之前,我實際上看過了那個正則表達式問題;我並不覺得答案對我所面對的所有迭代都足夠靈活,但非常感謝您參考它;我現在將測試您的解決方案;這將如何區分[email protected]和attn12,attn123,attn1234等合法電子郵件? – jeangelj

+0

你正在處理多少個獨特的地址?如果不太多,那麼我認爲你可以結合使用兩種方法:爲已使用(唯一)電子郵件定義一個list(),並在正則表達式捕獲那些以數字結尾的數字後,像@dman考慮的那樣檢查它們。雖然,[email protected]是一個問題,因爲[email protected]看起來也是有效的。 – pmus

+0

一次發送大約10萬封電子郵件,每個月都有不同的郵件 - 我認爲不可能定義一個列表,如果我理解正確,那麼每個月都會應用一個列表 – jeangelj

1

你可以挑一個用差異閾值編輯距離(又名Levenshtein distance)。在蟒蛇:

$pip install editdistance 
$ipython2 
>>> import editdistance 
>>> threshold = 5 # This could be anything, really 
>>> data = ["[email protected]", ...]# set up data to be the set you gave 
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold]) 

如果你想成爲它更聰明,你可以通過結果列表並運行,而不是把它變成一組,跟蹤有多少其他電子郵件地址,它是近的 - 然後使用作爲確定假貨的「重量」。

這讓你不僅給定的情況下(其中欺騙性地址都有一個共同的起點,只有在數字後綴不同,但額外的數字或字母填充例如在開始或電子郵件地址的中間。

+0

謝謝!現在測試 – jeangelj

2

您可以使用正則表達式來做到這一點,下面的例子:

import re 

a = "[email protected]" 
b = "[email protected]" 
c = "[email protected]" 
d = "[email protected]" 

pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}[email protected]") 

if pattern.search(a): 
    print("spam1") 

if pattern.search(b): 
    print("spam2") 

if pattern.search(c): 
    print("spam3") 

if pattern.search(d): 
    print("spam4") 

如果您運行的代碼,你會看到:

$ python spam.py 
spam1 
spam2 
spam3 
spam4 

ŧ他受益於這種方法是其標準化(正則表達式),並且您可以通過調整{}內的值輕鬆調整比賽的強度;這意味着您可以在其中設置/調整值的全局配置文件。您也可以輕鬆地調整正則表達式,而無需重寫代碼。

+0

謝謝 - 我有一個用戶[email protected],這是一個合法用戶,但attn12,attn123,attn1234,attn12345不是,我只想抓住那些 – jeangelj

+1

@ bbb31 ..或者使用CAPTCHA如果可能的話。 – pmus

1
ids = [s.split('@')[0] for s in email_list] 
det = np.zeros((len(ids), len(ids)), dtype=np.bool) 
for i in range(len(ids)): 
    for j in range(i + 1, len(ids)): 
     mi = ids[i] 
     mj = ids[j] 
     if len(mj) == len(mi) + 1 and mj.startswith(mi): 
      try: 
       int(mj[-1]) 
       det[j,i] = True 
       det[i,j] = True 
      except: 
       continue 

spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist() 
+0

謝謝!我現在會測試它 – jeangelj

1

我對如何解決這一點的想法:

fuzzywuzzy

創建了一套獨特的電子郵件,用於循環在他們和他們fuzzywuzzy比較。 例子:

from fuzzywuzzy import fuzz 

    for email in emailset: 

     for row in data: 
     emailcomp = re.search(pattern=r'(.+)@.+',string=email).groups()[0] 
     rowemail = re.search(pattern=r'(.+)@.+',string=row['email']).groups()[0] 
     if row['email']==email: 
        continue 

      elif fuzz.partial_ratio(emailcomp,rowemail)>80: 
        'flagging operation' 

我花了一些調戲數據是如何表示的,但我覺得變量名是足夠的記憶讓你明白我在獲得。這是一段非常粗糙的代碼,因爲我沒有想過如何停止重複標記。

不管怎樣,elif部分比較了兩個沒有@ gmail.com的電子郵件地址(或任何其他電子郵件,例如@yahoo。com),如果比率高於80(玩這個數字)使用你的標誌操作。 例如:

fuzz.partial_ratio("abc7020.1", "abc7020") 

+0

嗨!不用擔心,我不會讓有人花時間幫助我。非常感謝你的這個想法 - 我一定會嘗試;它實際上可能是一個很好的解決方案,我只是想確保它不會將合法的電子郵件標記爲垃圾郵件 – jeangelj

1

下面介紹一種方法,它應該是非常高效的。 我們通過分組電子郵件地址的長度來完成此操作,以便我們只需檢查每個電子郵件地址是否與下一級相匹配,即通過分片和設置成員資格檢查。

的代碼:

首先,在數據讀出:

import pandas as pd 
import numpy as np 

string = ''' 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
''' 

x = pd.DataFrame({'x':string.split()}) 
#remove duplicates: 
x = x[~x.x.duplicated()] 

我們剝去@ foo.bar部分,然後濾波器,只有那些以數字結尾,並添加上'長度'列:

#split on @, expand means into two columns 
emails = x.x.str.split('@', expand = True) 
#filter by last in string is a digit 
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()] 
#add a length of email column for the next step 
emails['lengths'] = emails.loc[:,0].str.len() 

現在,我們要做的就是取每個長度和長度-1,看看長度是多少。與它的最後一個字符下降,出現在一組n-1個長度的(並且,我們要檢查,如果情況正好相反,在情況下,它是最短的重複):

#unique lengths to check 
lengths = emails.lengths.unique() 
#mask to hold results 
mask = pd.Series([0]*len(emails), index = emails.index) 

#for each length 
for j in lengths: 
    #we subset those of that length 
    totest = emails['lengths'] == j 
    #and those who might be the shorter version 
    against = emails['lengths'] == j -1 
    #we make a set of unique values, for a hashed lookup 
    againstset = set([i for i in emails.loc[against,0]]) 
    #we cut off the last char of each in to test 
    tests = emails.loc[totest,0].str[:-1] 
    #we check matches, by checking the set 
    mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0) 
    #viceversa, otherwise we miss the smallest one in the group 
    againstset = set([i for i in emails.loc[totest,0].str[:-1]]) 
    tests = emails.loc[against,0] 
    mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0) 

得到的面膜可以轉換爲布爾值,並用於子集的原始(重複數據刪除)數據框,以及指數應與原指數於子集這樣的:

x.loc[~mask.astype(bool),:] 
    x 
0 [email protected] 
16 [email protected] 
17 [email protected] 

你可以看到,我們沒有刪除你的第一個值,作爲「 「。意味着它不匹配 - 您可以先刪除標點符號。