python：數據清理 - 檢測欺詐性電子郵件地址的模式

我正在使用我正在刪除的欺詐性電子郵件地址清理數據集。python：數據清理 - 檢測欺詐性電子郵件地址的模式

我建立了多個捕獲重複和欺詐域的規則。但有一個screnario，我不知道如何編寫python中的規則來標記它們。

所以我有這樣的例子規則：

#delete punction 
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))  

#flag yopmail 
pattern = "yopmail" 
match = df['email'].str.contains(pattern) 
df['yopmail'] = np.where(match, 'Y', '0') 

#flag duplicates 
df['duplicate']=df.email.duplicated(keep=False)

這是我不能想出一個規則來抓住它的數據。基本上我正在尋找一種方法來標記以相同方式開始的地址，但最後會有連續的數字。

[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected]

來源

2017-06-16 jeangelj

澄清 - 您提供的示例中哪些電子郵件地址會標記爲欺詐？ – Nicarus

所有這些例子都是欺詐性的 – jeangelj

所以[email protected]會好的，但是像abc1，abc12 ......這樣的東西都是欺騙性的？如果存在[email protected]，這些只會是欺詐行爲？ – Nicarus

我的解決辦法是效率不高，也不好看。但檢查一下，看看它是否適合你@jeangelj。它絕對適用於您提供的示例。祝你好運！

import os 
from random import shuffle 
from difflib import SequenceMatcher 

emails = [... ...] # for example the 16 email addresses you gave in your question 
shuffle(emails) # everyday i'm shuffling 
emails = sorted(emails) # sort that shit! 
names = [email.split('@')[0] for email in emails] 

T = 0.7 # <- set your string similarity threshold here!! 

split_indices=[] 
for i in range(1,len(emails)): 
    if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T: 
     split_indices.append(i) # we want to remember where dissimilar email address occurs 

grouped=[] 
for i in split_indices: 
    grouped.append(emails[:i]) 
grouped.append(emails[i:]) 
# now we have similar email addresses grouped, we want to find the common prefix for each group 
prefix_strings=[] 
for group in grouped: 
    prefix_strings.append(os.path.commonprefix(group)) 

# finally 
ham=[] 
spam=[] 
true_ids = [names.index(p) for p in prefix_strings] 
for i in range(len(emails)): 
    if i in true_ids: 
     ham.append(emails[i]) 
    else: 
     spam.append(emails[i]) 

In [30]: ham 
Out[30]: ['[email protected]', '[email protected]'] 

In [31]: spam 
Out[31]: 
['[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]', 
'[email protected]'] 

# THE TRUTH YALL!

來源

2017-07-15 01:08:28 Blue482

首先來看看正則表達式的問題here

其次，儘量過濾電子郵件地址一樣，：

# Let's email is = '[email protected]' 
email = '[email protected]' 
email_name = email.split(',', maxsplit=1)[0] 
# Here you get email_name = 'attn1234 
import re 
m = re.search(r'\d+$', email_name) 
# if the string ends in digits m will be a Match object, or None otherwise. 
if m is not None: 
    print ('%s is good' % email) 
else: 
    print ('%s is BAD' % email)

來源

2017-07-13 22:33:46 pmus

謝謝！在我發佈我的問題之前，我實際上看過了那個正則表達式問題;我並不覺得答案對我所面對的所有迭代都足夠靈活，但非常感謝您參考它;我現在將測試您的解決方案;這將如何區分[email protected]和attn12，attn123，attn1234等合法電子郵件？ – jeangelj

你正在處理多少個獨特的地址？如果不太多，那麼我認爲你可以結合使用兩種方法：爲已使用（唯一）電子郵件定義一個list（），並在正則表達式捕獲那些以數字結尾的數字後，像@dman考慮的那樣檢查它們。雖然，[email protected]是一個問題，因爲[email protected]看起來也是有效的。 – pmus

一次發送大約10萬封電子郵件，每個月都有不同的郵件 - 我認爲不可能定義一個列表，如果我理解正確，那麼每個月都會應用一個列表 – jeangelj

你可以挑一個用差異閾值編輯距離（又名Levenshtein distance）。在蟒蛇：

$pip install editdistance 
$ipython2 
>>> import editdistance 
>>> threshold = 5 # This could be anything, really 
>>> data = ["[email protected]", ...]# set up data to be the set you gave 
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold])

如果你想成爲它更聰明，你可以通過結果列表並運行，而不是把它變成一組，跟蹤有多少其他電子郵件地址，它是近的 - 然後使用作爲確定假貨的「重量」。

這讓你不僅給定的情況下（其中欺騙性地址都有一個共同的起點，只有在數字後綴不同，但額外的數字或字母填充例如在開始或電子郵件地址的中間。

來源

2017-07-13 22:40:03

謝謝！現在測試 – jeangelj

您可以使用正則表達式來做到這一點，下面的例子：

import re 

a = "[email protected]" 
b = "[email protected]" 
c = "[email protected]" 
d = "[email protected]" 

pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}[email protected]") 

if pattern.search(a): 
    print("spam1") 

if pattern.search(b): 
    print("spam2") 

if pattern.search(c): 
    print("spam3") 

if pattern.search(d): 
    print("spam4")

如果您運行的代碼，你會看到：

$ python spam.py 
spam1 
spam2 
spam3 
spam4

ŧ他受益於這種方法是其標準化（正則表達式），並且您可以通過調整{}內的值輕鬆調整比賽的強度;這意味着您可以在其中設置/調整值的全局配置文件。您也可以輕鬆地調整正則表達式，而無需重寫代碼。

來源

2017-07-13 23:06:28 user1529891

謝謝 - 我有一個用戶[email protected]，這是一個合法用戶，但attn12，attn123，attn1234，attn12345不是，我只想抓住那些 – jeangelj

@ bbb31 ..或者使用CAPTCHA如果可能的話。 – pmus

ids = [s.split('@')[0] for s in email_list] 
det = np.zeros((len(ids), len(ids)), dtype=np.bool) 
for i in range(len(ids)): 
    for j in range(i + 1, len(ids)): 
     mi = ids[i] 
     mj = ids[j] 
     if len(mj) == len(mi) + 1 and mj.startswith(mi): 
      try: 
       int(mj[-1]) 
       det[j,i] = True 
       det[i,j] = True 
      except: 
       continue 

spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist()

來源

2017-07-13 23:11:51

謝謝！我現在會測試它 – jeangelj

我對如何解決這一點的想法：

fuzzywuzzy

創建了一套獨特的電子郵件，用於循環在他們和他們fuzzywuzzy比較。例子：

from fuzzywuzzy import fuzz 

    for email in emailset: 

     for row in data: 
     emailcomp = re.search(pattern=r'(.+)@.+',string=email).groups()[0] 
     rowemail = re.search(pattern=r'(.+)@.+',string=row['email']).groups()[0] 
     if row['email']==email: 
        continue 

      elif fuzz.partial_ratio(emailcomp,rowemail)>80: 
        'flagging operation'

我花了一些調戲數據是如何表示的，但我覺得變量名是足夠的記憶讓你明白我在獲得。這是一段非常粗糙的代碼，因爲我沒有想過如何停止重複標記。

不管怎樣，elif部分比較了兩個沒有@ gmail.com的電子郵件地址（或任何其他電子郵件，例如@yahoo。com），如果比率高於80（玩這個數字）使用你的標誌操作。例如：

fuzz.partial_ratio("abc7020.1", "abc7020")

來源

2017-07-14 04:25:58 dman

嗨！不用擔心，我不會讓有人花時間幫助我。非常感謝你的這個想法 - 我一定會嘗試;它實際上可能是一個很好的解決方案，我只是想確保它不會將合法的電子郵件標記爲垃圾郵件 – jeangelj

下面介紹一種方法，它應該是非常高效的。我們通過分組電子郵件地址的長度來完成此操作，以便我們只需檢查每個電子郵件地址是否與下一級相匹配，即通過分片和設置成員資格檢查。

的代碼：

首先，在數據讀出：

import pandas as pd 
import numpy as np 

string = ''' 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected] 
''' 

x = pd.DataFrame({'x':string.split()}) 
#remove duplicates: 
x = x[~x.x.duplicated()]

我們剝去@ foo.bar部分，然後濾波器，只有那些以數字結尾，並添加上'長度'列：

#split on @, expand means into two columns 
emails = x.x.str.split('@', expand = True) 
#filter by last in string is a digit 
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()] 
#add a length of email column for the next step 
emails['lengths'] = emails.loc[:,0].str.len()

現在，我們要做的就是取每個長度和長度-1，看看長度是多少。與它的最後一個字符下降，出現在一組n-1個長度的（並且，我們要檢查，如果情況正好相反，在情況下，它是最短的重複）：

#unique lengths to check 
lengths = emails.lengths.unique() 
#mask to hold results 
mask = pd.Series([0]*len(emails), index = emails.index) 

#for each length 
for j in lengths: 
    #we subset those of that length 
    totest = emails['lengths'] == j 
    #and those who might be the shorter version 
    against = emails['lengths'] == j -1 
    #we make a set of unique values, for a hashed lookup 
    againstset = set([i for i in emails.loc[against,0]]) 
    #we cut off the last char of each in to test 
    tests = emails.loc[totest,0].str[:-1] 
    #we check matches, by checking the set 
    mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0) 
    #viceversa, otherwise we miss the smallest one in the group 
    againstset = set([i for i in emails.loc[totest,0].str[:-1]]) 
    tests = emails.loc[against,0] 
    mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)

得到的面膜可以轉換爲布爾值，並用於子集的原始（重複數據刪除）數據框，以及指數應與原指數於子集這樣的：

x.loc[~mask.astype(bool),:] 
    x 
0 [email protected] 
16 [email protected] 
17 [email protected]

你可以看到，我們沒有刪除你的第一個值，作爲「「。意味着它不匹配 - 您可以先刪除標點符號。

來源

2017-07-15 02:59:52 jeremycg

python：數據清理 - 檢測欺詐性電子郵件地址的模式

回答

相關問題