我有一個多維數組,我試圖將其輸入到difflib.get_close_matches()
中。我的數組看起來像這樣:array[(ORIGINAL, FILTERED)]
。 ORIGINAL
是一個字符串,並且FILTERED
是ORIGINAL
字符串,其中過濾了常用字。將函數列表提供給函數
我目前正在創建一個新陣列,只有FILTERED
單詞被輸入到difflib.get_close_matches()
。然後我嘗試將difflib
的結果與array[(ORIGINAL, FILTERED)]
相匹配。我的問題是,我經常有兩個或更多的FILTERED
單詞是等價的,因此它們不能用這種方法匹配。
有沒有一種方法,我可以養活整個array[(ORIGINAL,FILTERED)]
爲difflib
,但有它只能看FILTERED
部分(同時仍返回[(ORIGINAL,FILTERED)]
?)
提前感謝!
import time
import csv
import difflib
import sys
import os.path
import datetime
### Filters out common words in an attempt to get better results ###
def ignoredWords (word):
filtered = word.lower()
#Common Full Words
## Majority of filters were edited out
#Common Abbreviations
if "univ" in filtered:
filtered = filtered.replace("univ","")
#Special Characters
if " " in filtered: #Two White Spaces
filtered = filtered.replace(" "," ")
if "-" in filtered:
filtered = filtered.replace("-"," ")
if "\'" in filtered:
filtered = filtered.replace("\'"," ")
if " & " in filtered:
filtered = filtered.replace(" &","")
if "(\"" in filtered:
filtered = filtered.replace("(\"","")
if "\")" in filtered:
filtered = filtered.replace("\")","")
if "\t" in filtered:
filtered = filtered.replace("\t"," ")
return filtered
### Takes in a list, then outputs a 2D list. array[Original, Filtered] ###
### For XXX: array[Original, Filtered, Account Number, Code] ###
def create2DArray (list):
array = []
for item in list:
clean = ignoredWords(item[2])
entry = (item[2].lower(), clean, item[0],item[1])
array.append(entry)
return array
def main(argv):
if(len(argv) < 3):
print "Not enough parameters. Please enter two file names"
sys.exit(2)
elif (not os.path.isfile(argv[1])):
print "%s is not found" %(argv[1])
sys.exit(2)
elif (not os.path.isfile(argv[2])):
print "%s is not found" %(argv[2])
sys.exit(2)
#Recode File ----- Not yet implemented
# if(len(argv) == 4):
# if(not os.path.isfile(argv[3])):
# print "%s is not found" %(argv[3])
# sys.exit(2)
#
# recode = open(argv[1], 'r')
# try:
# setRecode = c.readlines()
# finally:
# recode.close()
# setRecode.sort()
# print setRecode[0]
#Measure execution time
t0 = time.time()
cReader = csv.reader(open(argv[1], 'rb'), delimiter='|')
try:
setC = []
for row in cReader:
setC.append(row)
finally:
setC.sort()
aReader = csv.reader(open(argv[2], 'rb'), delimiter='|')
try:
setA = []
for row in aReader:
setA.append(row)
finally:
setA.sort()
#Put Set A and Set C into their own 2 dimmensional arrays.array[Original Word] [Cleaned Up Word]
arrayC = create2DArray(setC)
arrayA = create2DArray(setA)
#Create clean list versions for use with difflib
cleanListC = []
for item in arrayC:
cleanListC.append(item[1])
cleanListA = []
for item in arrayA:
cleanListA.append(item[1])
############OUTPUT FILENAME############
fMatch75 = open("Match75.csv", 'w')
Match75 = csv.writer(fMatch75, dialect='excel')
try:
header = "Fuzzy Matching Report. Generated: "
header += str(datetime.date.today())
Match75.writerow([header])
Match75.writerow(['C','A','C Cleaned','A Cleaned','C Account', 'C Group','A Account', 'A Group', 'Filtered Ratio %','Unfiltered Ratio %','Average Ratio %'])
for item in cleanListC:
match = difflib.get_close_matches(item,cleanListA,1,0.75)
if len(match) > 0:
filteredratio = difflib.SequenceMatcher(None,item,match[0]).ratio()
strfilteredratio = '%.2f' % (filteredratio*100)
found = 0
for group in arrayA:
if match[0] == group[1]:
origA = group[0]
acode = group[3]
aaccount = group[2]
found = found + 1
for group in arrayC:
if item == group[1]:
origC = group[0]
ccode = group[3]
caccount = group[2]
found = found + 2
if found == 3:
unfilteredratio = difflib.SequenceMatcher(None,origC,origA).ratio()
strunfilteredratio = '%.2f' % (unfilteredratio*100)
averageratio = (filteredratio+unfilteredratio)/2
straverageratio = '%.2f' % (averageratio*100)
row = [origC.rstrip(),origA.rstrip(),item.rstrip(),match[0].rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio]
Match75.writerow(row)
#These Else Ifs are for debugging. If NULL is found anywhere in the CSV, then an error has occurred
elif found == 2:
row = [origC.rstrip(),"NULL",item.rstrip(),match[0].rstrip(),caccount,ccode,"NULL","NULL",strfilteredratio,"NULL","NULL"]
Match75.writerow(row)
elif found == 1:
row = ["NULL",origA.rstrip(),item.rstrip(),match[0].rstrip(),"NULL","NULL",aaccount,acode,strfilteredratio,"NULL","NULL"]
Match75.writerow(row)
else:
row = ["NULL","NULL",item.rstrip(),match[0].rstrip(),"NULL","NULL","NULL","NULL",strfilteredratio,"NULL","NULL"]
Match75.writerow(row)
finally:
Match75.writerow(["A Proprietary and Confidential. Do Not Distribute"])
fMatch75.close()
print (time.time()-t0,"seconds")
if __name__ == "__main__":
main(argv=sys.argv)
我想實現:
- 讀取輸入文件
- 從名字中篩選出常用詞,這樣的模糊匹配( 'difflib.get_close_matches()')將返回更準確的結果
- 將來自FileA的名稱與FileB中的名稱進行比較,以找出最有可能匹配的名稱。
- 打印原始(未過濾)名稱和匹配百分比。
這是爲什麼難以
在兩個輸入文件中使用的命名約定顯著變化。部分名稱部分縮寫(EX:文件A:Acme公司;文件B:Acme Co)。由於命名約定不一致,我不能做'FileA.intersect(FileB)',這將是理想的方式。
cleanListA = []
for item in arrayA:
cleanListA.append(item[1])
從而失去了(ORIGINAL,FILTERED)
配對:
當修改應該發生
for item in cleanListC:
match = difflib.get_close_matches(item,cleanListA,1,0.75)
CleanListA被創建。
最終目標
我想arrayA送入difflib.get_close_matches()
,而不是cleanListA保存(ORIGINAL,FILTERED)
配對。 difflib.get_close_matches()
只會在確定關閉匹配時查看配對的「過濾」部分,但會返回整個配對。
@MikeKusold你的意思是名爲「陣」是什麼?在Python中,我明白:(http://docs.python.org/library/array.html#module-array) – eyquem 2011-05-04 15:21:22
我使用array [ORIGINAL,FILTERED]作爲清晰描述變量的方法。您可以輕鬆地替換單詞[(Original,Filtered)]。 – MikeKusold 2011-05-04 15:28:02
我們需要你更具體,請。它是一個'array.array'對象嗎?或者它實際上是一個「列表」?或者通過「多維數組」,你實際上是否指「dict」?這些在Python中都是不同的。另外,請告訴我們你已經嘗試了什麼(使用代碼!)。越詳細越好! – jathanism 2011-05-04 15:33:00