2010-08-25 36 views
3

我有一個列表蟒蛇:複雜的字符串算法

listcdtitles = 

[""" Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka) """, 
""" Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro) """, 
""" Tchaikovsky, 'The Tempest' Fantasy. Liszt, Symphonic Poem #1. (London Symph./Butt) """, 
""" Duffy, John: 'Heritage: Civilization and the Jews'- Fanfare & Chorale, Symphonic Dances + Orchestral Suite. Bernstein, 'On the Town' Dance Episodes. (Royal Phil./R.Williams) """, 
""" Lilien, Ignace {1897-1963}: Songs, 1920-1935. (Anja van Wijk, mezzo & Frans van Ruth, piano) """, 
""" Hindemith, Trauermusik. Purcell, 'Fairy Queen' Suite. Rossini, String Sonata #6. Petrov, 'Creation of the World' Ballet Suite. Bartok, Romanian Folkdances Sz 56. Tartini, Flute Concerto in G {w.A.Maiorov} (Leningrad Orch.for Ancient & Modern Music/ Serov) """, 
""" Bizet, Verdi, Massenet, Puccini: Arias from Carmen, Rigoletto, Werther, Manon Lescaut, Tosca, Turandot + Songs by Lara, Di Capua et al. (Peter Dvorsky, tenor w.Bratislava Orch./Lenard {Also performing 'Carmen' Overt.& 'Thais' Meditation}. Rec.Live, 10/87) """, 
""" Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour) """, 
""" Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting) """, 
""" Gluck, Mozart, Beethoven, Weber, Verdi, Wagner, Ponchielli, Mascagni, Puccini: Arias from Alceste, Don Giovanni, Fidelio, Oberon, Ballo, Tristan, Walkure, Siegfried, Gotterdammerung, Gioconda, Cavalleria, Tosca. (Helene Wildbrunn. Rec.1919-24) """, 
""" Stanley, Wesley, Stubley, Boyce, Handel, Heron, Russell, Hook: '18th Century Organ Music on Period Instruments' (Same instruments and artist as above) """, 
""" Reimann, 'Unrevealed' for Baritone & String Quartet to Texts by Lord Byron {R.Salter w.Kreuzberger Quartet}; Variations for Piano (David Levine) """, 
""" Bruckner, Symphony #9. (Berlin Philharmonic/ Jochum. Rec. 'live', 11/28/77) """, 
""" Bruckner, Symphony #5. (Haas Edition. BBC Symph./ Horenstein. Rec.9/71) """, 
..............................] 

我在這個列表

我想大約14000元聚成一團那些串在一起具有類似的話。

有關如何做到這一點的任何想法?我不認爲這是一個正確/錯誤的方式

非常感謝你提出的任何建議

+4

定義「聚成一團」稍微清楚請 – fredley 2010-08-25 23:19:11

+0

我希望他們能夠concatnoated之間較高的相似性,如果你需要的話,你也可以得到這個位置 – 2010-08-25 23:19:54

+0

類似於levenshtein/soundex?如果這樣比你可能需要在每個字符串之間創建一個距離矩陣。在列表中讀取它並使用'sorted()'方法。 – Wolph 2010-08-25 23:21:23

回答

2

我是python語言的新手,但我寫了一個示例代碼來計算該列表中條目之間的相似性分數。

代碼如下。

import re 
import array 

listcdtitles = [""" Liszt, Hungarian Rhapsody #6 {'Pesther Carneval'}; 2 Episodes from Lenau's 'Faust'; 'Hunnenschlacht' Symphonic Poem. (NW German Phil./ Kulka) """, 
""" Puccini, Verdi, Gounod, Bizet: Arias & Duets from Butterfly, Tosca, Boheme, Turandot, I Vespri, Faust, Carmen. (Fiamma Izzo d'Amico & Peter Dvorsky w.Berlin Radio Symph./Paternostro) """, 
""" Tchaikovsky, 'The Tempest' Fantasy. Liszt, Symphonic Poem #1. (London Symph./Butt) """, 
""" Duffy, John: 'Heritage: Civilization and the Jews'- Fanfare & Chorale, Symphonic Dances + Orchestral Suite. Bernstein, 'On the Town' Dance Episodes. (Royal Phil./R.Williams) """, 
""" Lilien, Ignace {1897-1963}: Songs, 1920-1935. (Anja van Wijk, mezzo & Frans van Ruth, piano) """, 
""" Hindemith, Trauermusik. Purcell, 'Fairy Queen' Suite. Rossini, String Sonata #6. Petrov, 'Creation of the World' Ballet Suite. Bartok, Romanian Folkdances Sz 56. Tartini, Flute Concerto in G {w.A.Maiorov} (Leningrad Orch.for Ancient & Modern Music/ Serov) """, 
""" Bizet, Verdi, Massenet, Puccini: Arias from Carmen, Rigoletto, Werther, Manon Lescaut, Tosca, Turandot + Songs by Lara, Di Capua et al. (Peter Dvorsky, tenor w.Bratislava Orch./Lenard {Also performing 'Carmen' Overt.& 'Thais' Meditation}. Rec.Live, 10/87) """, 
""" Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour) """, 
""" Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting) """, 
""" Gluck, Mozart, Beethoven, Weber, Verdi, Wagner, Ponchielli, Mascagni, Puccini: Arias from Alceste, Don Giovanni, Fidelio, Oberon, Ballo, Tristan, Walkure, Siegfried, Gotterdammerung, Gioconda, Cavalleria, Tosca. (Helene Wildbrunn. Rec.1919-24) """, 
""" Stanley, Wesley, Stubley, Boyce, Handel, Heron, Russell, Hook: '18th Century Organ Music on Period Instruments' (Same instruments and artist as above) """, 
""" Reimann, 'Unrevealed' for Baritone & String Quartet to Texts by Lord Byron {R.Salter w.Kreuzberger Quartet}; Variations for Piano (David Levine) """, 
""" Bruckner, Symphony #9. (Berlin Philharmonic/ Jochum. Rec. 'live', 11/28/77) """, 
""" Bruckner, Symphony #5. (Haas Edition. BBC Symph./ Horenstein. Rec.9/71) """] 

entryDictionary = {} 
i=0 
for entry in listcdtitles: 
    #remove unnecessary characters from the string 
    entry=re.sub(r'[^\w ]', '', entry.lower(), flags=re.IGNORECASE) 
    #split the entry into words and store it in the 
    entryDictionary[i]=entry.split(" ") 
    i=i+1 
# print the dictionary 
print("Entries") 
print(entryDictionary) 

# define a score matrix, compare the words in each entry and if 
# a word is same in both entries, that is one point 
scoreMatrix = [] 
for k in range(i): 
    scoreMatrix.append([]) 
    for j in range (i): 
     if j>k: 
      scoreMatrix[k].append(0) 
     else: 
      scoreMatrix[k].append("-") 
k=0 
j=0 

for k in range(i-1): 
    entry1 = entryDictionary[k] 
    for j in range(k+1,i): 
     entry2 = entryDictionary[j] 
     for kk in range(len(entry1)): 
      for jj in range(len(entry2)): 
       if entry1[kk] != "" and entry1[kk] == entry2[jj]: 
        scoreMatrix[k][j] = scoreMatrix[k][j] + 1 

print "Score Matrix (Higher numbers denote heigher similarity between two entries" 

print repr("").rjust(10), 
for k in range(i-1): 
    print repr("Entry " + str(k)).rjust(10), 
print repr("Entry " + str(i-1)).rjust(10) 

for k in range(i): 
    scoreMatrix.append([]) 
    print repr("Entry " + str(k)).rjust(10), 
    for j in range (i-1): 
     print repr(scoreMatrix[k][j]).rjust(10), 
    print repr(scoreMatrix[k][i-1]).rjust(10) 

結果如下: 得分矩陣(數字越大,表示兩個條目

 '' 'Entry 0' 'Entry 1' 'Entry 2' 'Entry 3' 'Entry 4' 'Entry 5' 'Entry 6' 'Entry 7' 'Entry 8' 'Entry 9' 'Entry 10' 'Entry 11' 'Entry 12' 'Entry 13' 
'Entry 0'  '-'   2   3   2   0   1   1   0   1   1   0   0   0   0 
'Entry 1'  '-'  '-'   0   0   0   0   11   0   2   5   0   0   0   0 
'Entry 2'  '-'  '-'  '-'   3   0   1   0   1   0   0   0   0   0   0 
'Entry 3'  '-'  '-'  '-'  '-'   0   4   0   2   0   0   2   0   0   0 
'Entry 4'  '-'  '-'  '-'  '-'  '-'   0   1   0   0   0   0   1   0   0 
'Entry 5'  '-'  '-'  '-'  '-'  '-'  '-'   0   3   1   0   1   1   0   0 
'Entry 6'  '-'  '-'  '-'  '-'  '-'  '-'  '-'   0   2   5   0   1   0   0 
'Entry 7'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'   0   0   0   0   0   0 
'Entry 8'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'   2   0   0   0   0 
'Entry 9'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'   0   0   0   0 
'Entry 10'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'   0   0   0 
'Entry 11'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'   0   0 
'Entry 12'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'   2 
'Entry 13'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-'  '-' 
+0

嘿zafer這是非常棒的你。但我真的不明白這是什麼?請你解釋一下 – 2010-08-26 23:07:51

+0

嗨, 我更新了代碼。它有一些錯誤。現在你可以看到矩陣。你會看到條目之間的相似度分數。 如何一個SIM卡。分數計算如下: 條目1:'a',B,D 條目2:A,c,X 得分爲1,因爲'a'和A. 這是一種簡單的方法。我希望它有幫助。 – Zafer 2010-08-27 00:00:46

+0

我已經刪除了以下行的條目: entry = re.sub(r'[^ \ w]','',entry.lower(),flags = re.IGNORECASE) 因此,該算法只是使用單詞計算。排除諸如「和」之類的詞可以被添加以提高質量。 – Zafer 2010-08-27 00:04:05

1

首先,解析這一切,每個令牌的頻率關聯。高頻標記必須列入黑名單。

然後,您必須比較字符串,迭代它們,並將元組關聯到距離分數。根據這個分數,你會連接它們 - 不然。

這將是一個簡單的方法來做到這一點。