2017-10-18 75 views
0

我正在嘗試使用模糊匹配將對驗證集的響應列表對齊。模糊匹配 - 返回測試字符串的最佳潛在值

我使用下面的代碼:

for x in rawDatabase.Status: 
     choice = process.extractOne(x, my_list) 
     print('choice ',choice) 

凡在rawDatabase數據幀中的狀態欄是我試圖驗證列。 my_list是要鎖定的Status列中的條目的標準化值列表。

使用上面的代碼中,我得到了下面的示例輸出:

choice ('TRANSFER IN FROM GOVERNMENT DEPARTMENT', 100, 39) 
choice ('TRANSFER OUT TO GOVERNMENT DEPARTMENT', 100, 40) 
choice ('CURRENT', 100, 1) 
choice ('LEAVER - RETIRED', 100, 12) 
choice ('CURRENT', 100, 1) 

有沒有一種方法可以讓我回到最適合該字符串被更新後的值進行測試和更新rawDatabase狀態列中的值?因此,例如,我就會回到

choice = 'TRANSFER IN FROM GOVERNMENT DEPARTMENT' 
choice = 'TRANSFER OUT TO GOVERNMENT DEPARTMENT' 
choice = 'CURRENT' 
choice = 'LEAVER - RETIRED' 
choice = 'CURRENT' 
+0

[在Python模糊字符串比較,與使用哪個混淆庫(可能的重複https://stackoverflow.com/questions/6690739/fuzzy-string-comparison-in-python -with-library-to-use) – Jan

+0

使用'Levenshtein'距離或'difflib'。 – Jan

回答

1

修改你的代碼

l1=[] 
for x in rawDatabase.Status: 
     choice = process.extractOne(x, my_list)[0] 
     l1.append(choice) 
rawDatabase['choice']=l1 

更多例子:

from fuzzywuzzy import fuzz 
from fuzzywuzzy import process 
a=[] 
for x in df.response: 
    a.append([process.extract(x, val.validate, limit=1)][0][0][0]) 
df['response2']=a 
df 

Out[867]: 
    id colour response response2 
0 1 blue curent current 
1 2  red loaning  loan 
2 3 yellow current current 
3 4 green  loan  loan 
4 5  red currret current 
5 6 green  loan  loan 

輸入數據:

DF:

id colour response 
1 blue curent 
2 red loaning 
3 yellow current 
4 green  loan 
5 red currret 
6 green  loan 

纈氨酸:

validate 
current 
    loan 
transfer 
+0

這項工作不應該浪費! ;) – MaxU

+0

@MaxU謝謝老兄〜:) – Wen