刪除重複行

我有一個製表符分隔的電子表格，我試圖找出一種方法來刪除重複的條目。下面是具有相同的形式在電子表格中的數據一些虛構的數據：

name phone email website 
Diane Grant Albrecht M.S.   
"Lannister G. Cersei M.A.T., CEP" 111-222-3333 [email protected] www.got.com 
Argle D. Bargle Ed.M.   
Sam D. Man Ed.M. 000-000-1111 [email protected] www.daManWithThePlan.com 
Sam D. Man Ed.M.  
Sam D. Man Ed.M. 111-222-333  [email protected] www.daManWithThePlan.com 
D G Bamf M.S.   
Amy Tramy Lamy Ph.D.

我想有重複的行薩姆D.人合併成一個令兩個電話號碼，但沒有按」 t存儲兩個相同的電子郵件和兩個相同的網站。

我想這樣做的方式是存儲上一行並比較名稱。如果名稱匹配，則比較電話號碼。如果電話號碼不匹配，請追加到第一行。然後比較電子郵件。如果電子郵件不匹配，請追加到第一行。然後比較網站。如果網站不匹配，則將第二個網站附加到第一個網站。然後刪除第二行。

我不知道如何刪除一行。其他帖子似乎避免通過將行寫入新文件來實際刪除行。但我認爲這對我的情況是有問題的，因爲我不想用兩次相同的名字來寫行。
有沒有更有效的方法來循環？嵌套for循環需要一段時間。
1. 我能看到自己運行到問題與索引超過了極限...

這裏是我的代碼：

with(open('ieca_first_col_fake_text.txt', 'rU')) as f: 
    sheet = csv.DictReader(f, delimiter = '\t') 

# This function takes a tab-delim csv and merges the ones with the same name but different phone/email/websites. 
def merge_duplicates(sheet): 

    # Since duplicates immediately follow, store adjacent and compare. If the same name, append phone number 
    for row in sheet: 
     for other_row in sheet: 
      if row['name'] == other_row['name']: 
       if row['email'] != other_row['email']: 
        row['email'].append(other_row['email']) 
       if row['website'] != other_row['website']: 
        row['website'].append(other_row['website']) 

    # code to remove duplicate row 
    # delete.() or something... 

merge_duplicates(sheet)

來源

2013-07-03 goldisfine

我想你會想要將合併的結果存儲在某個地方（或者在另一個文件中，或者暫時存儲在內存中，具體取決於你使用的數據量），否則你會從文件中刪除行正在迭代，這通常被認爲是否定的。 – erewok

好吧，以便解決我提到的問題。當我遍歷示例電子表格併到達Sam D. Man的第二個實例時會發生什麼？ – goldisfine

你有沒有考慮過使用「熊貓」。你的問題有很多解決方案。 – LonelySoul

在這種情況下，這取決於如何大你的「工作表'是，將csv.DictReader對象轉換爲列表可能會很有用，以便您可以對其進行分片並以這種方式比較各個字段。我認爲你的邏輯是正確的，當你說以下內容：

我想這樣做的方式是存儲上一行並比較名稱。 1）如果名字匹配，則2）比較電話號碼。如果電話號碼不匹配，3）追加到第一行。 4）然後比較電子郵件。 5）如果電子郵件不匹配，追加到第一行。 6）然後比較網站。 7）如果網站不匹配，則將第二個網站附加到第一個網站。然後刪除第二行。（沒有必要，只是跳過）

這裏是我的（工作前迅速寫的）建議：

with(open('ieca_first_col_fake_text.txt', 'rU')) as f: 
    sheet = csv.DictReader(f, delimiter = '\t') 

def merge_duplicates(sheet): 
    mysheet = list(sheet) 

    for rowvalue, row in enumerate(mysheet): 
     try: 
      for other_row in mysheet[rowvalue+1:]    

       if row['name'] == other_row['name']: # check if it's a duplicate name 
        other_row['delete'] = "duplicate row" #add delete key for later sorting 
        if row['email'] != other_row['email']: 
         row['alt_email'] = other_row['email'] # add new "alt_email" key to original row 
        # test other fields here... 
     except IndexError: 
      print("We're at the end now")

之後，你需要遍歷並忽略每一行與「刪除」鍵入它並保留只有那些沒有。

來源

2013-07-03 16:05:19 erewok

回答

相關問題