2012-10-12 153 views
0

與eumiro Delete duplicate rows in textfile - except it contains a "{" or "}" 的幫助下刪除文本文件重複字的組合,我可以成功地刪除重複的線路在一個大文本文件。這是從60MB到3MB文本文件的一大步。與蟒蛇

但現在我想刪除重複的話是這樣的:

@INBOOK{Miller1992, 
    author = {Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark 
    R. Leary and Miller, Rowland S. und Mark R. Leary and Miller, Rowland 
    S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary and 
    Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark 
    Miller, Rowland S. und Mark R. Leary}, 
    year = {1992}, 
    editor = {Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun 
    A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A. 
    van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A. van 
    Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk 
    and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and 
    Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk and Teun 
    and Teun A. van Dijk and Teun A. van Dijk and Teun A. van Dijk}, 
    title = {Handbook of discourse analysis (Bd. 3/4)}, 

的結果應該是這樣的:

@INBOOK{Miller1992, 
    author = {Miller, Rowland S. und Mark R. Leary}, 
    year = {1992}, 
    editor = {Teun A. van Dijk}, 
    title = {Handbook of discourse analysis (Bd. 3/4)}, 

文本文件有70000行和authornames可以在多個項目中使用。所以也就只有在大括號中的重複(多行)應刪除:

author = {Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark 
    R. Leary and Miller, Rowland S. und Mark R. Leary and Miller, Rowland 
    S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary and 
    Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark 
    Miller, Rowland S. und Mark R. Leary}, 

我想修改我的Python-Skript其刪除重複行的大括號刪除重複的話,但我stucked:

words_seen = set() # holds words already seen 
outfile = open("literatur_clean.txt", "w") 
for line in open("literatur_dupl.txt", "r"): 
    if ('{' in line or '}' in line 
     # some code to check whether the words are duplicate 
outfile.close() 

回答

1

根據您當前的數據集,它看起來不像是重複單詞的問題,而是有時候作者或編輯器會重複n次。

你可以嘗試分裂的字符串「和」。然後你可以看到其餘的項目是否都是一樣的。 (例如放置一組或作爲字典鍵的所有字符串)。如果集的長度等於1,您已刪除所有副本。如果沒有,可能「和」也是作者或編輯名字的一部分。你必須再次合併這兩個。

如果不工作(例如,因爲數據集不是整齊的建議),你可以通過查找子集匹配查找重複匹配:的開始後

Miller, Rowland S. und Mark R. Leary and Miller, Rowland S. und Mark R. Leary 
^          ^
1          2 

增量指針到文本字符串串。爲每個位置查找字符串開頭最長的子匹配。保存這些子匹配。

+0

感謝您的回答,第一個方法似乎不太適合,但我會嘗試第二種方法。 – StandardNerd