2011-05-04 66 views
1

我有一個多維數組,我試圖將其輸入到difflib.get_close_matches()中。我的數組看起來像這樣:array[(ORIGINAL, FILTERED)]ORIGINAL是一個字符串,並且FILTEREDORIGINAL字符串,其中過濾了常用字。將函數列表提供給函數

我目前正在創建一個新陣列,只有FILTERED單詞被輸入到difflib.get_close_matches()。然後我嘗試將difflib的結果與array[(ORIGINAL, FILTERED)]相匹配。我的問題是,我經常有兩個或更多的FILTERED單詞是等價的,因此它們不能用這種方法匹配。

有沒有一種方法,我可以養活整個array[(ORIGINAL,FILTERED)]difflib,但有它只能看FILTERED部分(同時仍返回[(ORIGINAL,FILTERED)]?)

提前感謝!

import time 
import csv 
import difflib 
import sys 
import os.path 
import datetime 

### Filters out common words in an attempt to get better  results ### 
def ignoredWords (word): 
    filtered = word.lower() 
    #Common Full Words 
## Majority of filters were edited out 
    #Common Abbreviations 
    if "univ" in filtered: 
     filtered = filtered.replace("univ","") 
    #Special Characters 
    if " " in filtered: #Two White Spaces 
     filtered = filtered.replace(" "," ") 
    if "-" in filtered: 
     filtered = filtered.replace("-"," ") 
    if "\'" in filtered: 
     filtered = filtered.replace("\'"," ") 
    if " & " in filtered: 
     filtered = filtered.replace(" &","") 
    if "(\"" in filtered: 
     filtered = filtered.replace("(\"","") 
    if "\")" in filtered: 
     filtered = filtered.replace("\")","") 
    if "\t" in filtered: 
     filtered = filtered.replace("\t"," ") 
    return filtered 

### Takes in a list, then outputs a 2D list. array[Original, Filtered] ### 
### For XXX: array[Original, Filtered, Account Number, Code] ### 
def create2DArray (list): 
    array = [] 
    for item in list: 
     clean = ignoredWords(item[2]) 
     entry = (item[2].lower(), clean, item[0],item[1]) 
     array.append(entry) 
    return array 

def main(argv): 
    if(len(argv) < 3): 
     print "Not enough parameters. Please enter two file names" 
     sys.exit(2) 
    elif (not os.path.isfile(argv[1])): 
     print "%s is not found" %(argv[1]) 
     sys.exit(2) 
    elif (not os.path.isfile(argv[2])): 
     print "%s is not found" %(argv[2]) 
     sys.exit(2) 
    #Recode File ----- Not yet implemented 
#  if(len(argv) == 4): 
#  if(not os.path.isfile(argv[3])): 
#   print "%s is not found" %(argv[3]) 
#   sys.exit(2) 
#   
#  recode = open(argv[1], 'r') 
#  try: 
#   setRecode = c.readlines() 
#  finally: 
#   recode.close() 
#   setRecode.sort() 
#   print setRecode[0] 
    #Measure execution time 
    t0 = time.time() 

    cReader = csv.reader(open(argv[1], 'rb'), delimiter='|') 
    try: 
     setC = [] 
     for row in cReader: 
      setC.append(row) 
    finally: 
     setC.sort() 

    aReader = csv.reader(open(argv[2], 'rb'), delimiter='|') 
    try: 
     setA = [] 
     for row in aReader: 
      setA.append(row) 
    finally: 
     setA.sort() 

    #Put Set A and Set C into their own 2 dimmensional arrays.array[Original Word] [Cleaned Up Word] 
    arrayC = create2DArray(setC) 
    arrayA = create2DArray(setA) 

    #Create clean list versions for use with difflib 
    cleanListC = [] 
    for item in arrayC: 
     cleanListC.append(item[1]) 

    cleanListA = [] 
    for item in arrayA: 
     cleanListA.append(item[1]) 

    ############OUTPUT FILENAME############ 
    fMatch75 = open("Match75.csv", 'w') 
    Match75 = csv.writer(fMatch75, dialect='excel') 
    try: 
     header = "Fuzzy Matching Report. Generated: " 
     header += str(datetime.date.today()) 
     Match75.writerow([header]) 
     Match75.writerow(['C','A','C Cleaned','A Cleaned','C Account', 'C Group','A Account', 'A Group', 'Filtered Ratio %','Unfiltered Ratio %','Average Ratio %']) 
     for item in cleanListC: 
      match = difflib.get_close_matches(item,cleanListA,1,0.75) 

      if len(match) > 0: 
       filteredratio = difflib.SequenceMatcher(None,item,match[0]).ratio() 
       strfilteredratio = '%.2f' % (filteredratio*100) 
       found = 0 
       for group in arrayA: 
        if match[0] == group[1]: 
         origA = group[0] 
         acode = group[3] 
         aaccount = group[2] 
         found = found + 1 
       for group in arrayC: 
        if item == group[1]: 
         origC = group[0] 
         ccode = group[3] 
         caccount = group[2] 
         found = found + 2 
       if found == 3: 
        unfilteredratio = difflib.SequenceMatcher(None,origC,origA).ratio() 
        strunfilteredratio = '%.2f' % (unfilteredratio*100) 
        averageratio = (filteredratio+unfilteredratio)/2 
        straverageratio = '%.2f' % (averageratio*100) 

        row = [origC.rstrip(),origA.rstrip(),item.rstrip(),match[0].rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio] 
        Match75.writerow(row) 
       #These Else Ifs are for debugging. If NULL is found anywhere in the CSV, then an error has occurred 
       elif found == 2: 
        row = [origC.rstrip(),"NULL",item.rstrip(),match[0].rstrip(),caccount,ccode,"NULL","NULL",strfilteredratio,"NULL","NULL"] 
        Match75.writerow(row) 
       elif found == 1: 
        row = ["NULL",origA.rstrip(),item.rstrip(),match[0].rstrip(),"NULL","NULL",aaccount,acode,strfilteredratio,"NULL","NULL"] 
        Match75.writerow(row) 
      else: 
        row = ["NULL","NULL",item.rstrip(),match[0].rstrip(),"NULL","NULL","NULL","NULL",strfilteredratio,"NULL","NULL"] 
        Match75.writerow(row) 

    finally: 
     Match75.writerow(["A Proprietary and Confidential. Do Not Distribute"]) 
     fMatch75.close() 

    print (time.time()-t0,"seconds") 

if __name__ == "__main__": 
    main(argv=sys.argv) 

我想實現:

  1. 讀取輸入文件
  2. 從名字中篩選出常用詞,這樣的模糊匹配( 'difflib.get_close_matches()')將返回更準確的結果
  3. 將來自FileA的名稱與FileB中的名稱進行比較,以找出最有可能匹配的名稱。
  4. 打印原始(未過濾)名稱和匹配百分比。

這是爲什麼難以

在兩個輸入文件中使用的命名約定顯著變化。部分名稱部分縮寫(EX:文件A:Acme公司;文件B:Acme Co)。由於命名約定不一致,我不能做'FileA.intersect(FileB)',這將是理想的方式。

cleanListA = [] 
    for item in arrayA: 
     cleanListA.append(item[1]) 

從而失去了(ORIGINAL,FILTERED)配對:

當修改應該發生

for item in cleanListC: 
    match = difflib.get_close_matches(item,cleanListA,1,0.75) 

CleanListA被創建。

最終目標

我想arrayA送入difflib.get_close_matches(),而不是cleanListA保存(ORIGINAL,FILTERED)配對。 difflib.get_close_matches()只會在確定關閉匹配時查看配對的「過濾」部分,但會返回整個配對。

+0

@MikeKusold你的意思是名爲「陣」是什麼?在Python中,我明白:(http://docs.python.org/library/array.html#module-array) – eyquem 2011-05-04 15:21:22

+0

我使用array [ORIGINAL,FILTERED]作爲清晰描述變量的方法。您可以輕鬆地替換單詞[(Original,Filtered)]。 – MikeKusold 2011-05-04 15:28:02

+0

我們需要你更具體,請。它是一個'array.array'對象嗎?或者它實際上是一個「列表」?或者通過「多維數組」,你實際上是否指「dict」?這些在Python中都是不同的。另外,請告訴我們你已經嘗試了什麼(使用代碼!)。越詳細越好! – jathanism 2011-05-04 15:33:00

回答

0

由於您已經直接使用SequenceMatcher來獲得匹配率,所以您最直接的更改可能是自己執行get_close_matches操作。

比較get_close_matches()的來源[例如,在第737行附近的http://svn.python.org/view/python/tags/r271/Lib/difflib.py?revision=86833&view=markup]。它返回具有最高比率的n序列的列表。由於您只需要最佳匹配,因此您可以跟蹤(原始,過濾,比率)比例到目前爲止最高的位置,而不是原始方法使用的heapq追蹤最高點。

例如,在發生主循環的,是這樣的:

seqm = difflib.SequenceMatcher() 

for i in arrayC: 
    origC, cleanC, caccount, ccode = i 
    seqm.set_seq2(cleanC) 

    bestRatio = 0 

    for j in arrayA: 
    origA, cleanA = j[:2] 
    seqm.set_seq1(cleanA) 

    if (seqm.real_quick_ratio() >= bestRatio and 
     seqm.quick_ratio() >= bestRatio): 
     r = seqm.ratio() 
     if r >= bestRatio: 
     bestRatio = r 
     bestA = j 

    if bestRatio >= 0.75: # the cutoff from the original get_close_matches() call 
    origA, cleanA, aaccount, acode = bestA 

    filteredratio = bestRatio 
    strfilteredratio = '%.2f' % (filteredratio*100) 

    seqm.set_seqs(origC, origA) 
    unfilteredratio = seqm.ratio() 
    strunfilteredratio = '%.2f' % (unfilteredratio*100) 

    averageratio = (filteredratio+unfilteredratio)/2 
    straverageratio = '%.2f' % (averageratio*100) 

    row = [origC.rstrip(),origA.rstrip(),cleanC.rstrip(),cleanA.rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio] 
    else: 
    row = ["NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","0.00","NULL","NULL"] 

    Match75.writerow(row) 
+0

我最終走了這條路。我希望有一種方法可以列表清單,比如list [1] [i]或其他東西,但是這也起作用。 – MikeKusold 2011-05-13 01:23:55

相關問題