2015-10-29 110 views
-1

我想從字符串列表中刪除元素(從文件讀取)。這些元素本身就是一個列表(以逗號分隔的字符串形式)。python從字符串列表中刪除元素

我想從列表中刪除具有相同元素的字符串。 對於例如:

1:GGSIPU,RANK,BTECH,9

2:GGSIPU,BTECH,RANK,9

3:GGSIPU,BTECH,RANK,9

因此線2和3應該被刪除。

這裏是我的代碼:

# to remove duplicates 

with open('itemset3.txt', 'r') as f: 
    lines = f.readlines() 
    f.close() 

i = 0 

while (i<len(lines)): 
    j = i + 1 
    temp = [] 
    temp1 = lines[i].split(',') 
    print 'outer %d %s' % (i,temp1) 
    temp.append(temp1[0]) 
    temp.append(temp1[1]) 
    temp.append(temp1[2]) 
    while (j<len(lines)): 
     if all(t in lines[j] for t in temp): 
      print temp, ' found at ',j,': ',lines[j] 
      # lines.remove(lines[j]) 
      del lines[j] 
     j = j + 1 
    i = i + 1 

f = open('itemset3.txt', 'w') 
i = 0 
while (i<len(lines)): 
    f.write(lines[i]) 
    i = i + 1 
f.close() 

,這裏是文本文件

GGSIPU,RANK,BTECH,9 
GGSIPU,BTECH,RANK,9 
GGSIPU,BTECH,RANK,9 
GGSIPU,SEMESTER,RANK,9 
GGSIPU,CALCULATOR,RANK,9 
GGSIPU,CHECK,RANK,7 
GGSIPU,Certified,RANK,7 
GGSIPU,Winner,RANK,7 
GGSIPU,Application,RANK,7 
GGSIPU,Techexpo2015,RANK,7 
GGSIPU,Students,RANK,6 
RANK,BTECH,GGSIPU,9 
RANK,BTECH,GGSIPU,9 
RANK,BTECH,GGSIPU,9 
RANK,SEMESTER,GGSIPU,9 
RANK,SEMESTER,GGSIPU,9 
RANK,CALCULATOR,GGSIPU,9 
RANK,CALCULATOR,GGSIPU,9 
RANK,CHECK,GGSIPU,7 
RANK,CHECK,GGSIPU,7 
RANK,Certified,GGSIPU,7 
RANK,Certified,GGSIPU,7 
RANK,Winner,GGSIPU,7 
RANK,Winner,GGSIPU,7 
RANK,Application,GGSIPU,7 
RANK,Application,GGSIPU,7 
RANK,Techexpo2015,GGSIPU,7 
RANK,Techexpo2015,GGSIPU,7 
RANK,Students,GGSIPU,6 
RANK,Students,GGSIPU,6 
BTECH,SEMESTER,GGSIPU,9 
BTECH,CALCULATOR,GGSIPU,9 
SEMESTER,CALCULATOR,GGSIPU,9 
CHECK,Certified,GGSIPU,7 
CHECK,Winner,GGSIPU,7 
CHECK,Application,GGSIPU,7 
CHECK,Techexpo2015,GGSIPU,7 
CHECK,Students,GGSIPU,6 
Certified,Winner,GGSIPU,7 
Certified,Application,GGSIPU,7 
Certified,Techexpo2015,GGSIPU,7 
Certified,Students,GGSIPU,6 
Winner,Application,GGSIPU,7 
Winner,Techexpo2015,GGSIPU,7 
Winner,Students,GGSIPU,6 
Application,Techexpo2015,GGSIPU,7 
Application,Students,GGSIPU,6 
Techexpo2015,Students,GGSIPU,6 

的問題是,在運行代碼後,仍有輸出一些多餘的(重複)線。我應該如何糾正它?

這裏是在做出元組的輸出中:

('Certified', 'Winner', 'GGSIPU', '7') 
('RANK', 'Application', 'GGSIPU', '7') 
('Techexpo2015', 'Students', 'GGSIPU', '6') 
('CHECK', 'Certified', 'GGSIPU', '7') 
('RANK', 'SEMESTER', 'GGSIPU', '9') 
('Application', 'Techexpo2015', 'GGSIPU', '7') 
('GGSIPU', 'SEMESTER', 'RANK', '9') 
('CHECK', 'Techexpo2015', 'GGSIPU', '7') 
('RANK', 'Winner', 'GGSIPU', '7') 
('CHECK', 'Winner', 'GGSIPU', '7') 
('Winner', 'Students', 'GGSIPU', '6') 
('GGSIPU', 'Winner', 'RANK', '7') 
('GGSIPU', 'BTECH', 'RANK', '9') 
('RANK', 'Techexpo2015', 'GGSIPU', '7') 
('Certified', 'Students', 'GGSIPU', '6') 
('GGSIPU', 'CHECK', 'RANK', '7') 
('RANK', 'BTECH', 'GGSIPU', '9') 
('GGSIPU', 'Students', 'RANK', '6') 
('RANK', 'CALCULATOR', 'GGSIPU', '9') 
('Winner', 'Techexpo2015', 'GGSIPU', '7') 
('GGSIPU', 'Certified', 'RANK', '7') 
('RANK', 'CHECK', 'GGSIPU', '7') 
('CHECK', 'Application', 'GGSIPU', '7') 
('RANK', 'Certified', 'GGSIPU', '7') 
('GGSIPU', 'RANK', 'BTECH', '9') 
('GGSIPU', 'CALCULATOR', 'RANK', '9') 
('CHECK', 'Students', 'GGSIPU', '6') 
('GGSIPU', 'Application', 'RANK', '7') 
('GGSIPU', 'Techexpo2015', 'RANK', '7') 
('Winner', 'Application', 'GGSIPU', '7') 
('BTECH', 'SEMESTER', 'GGSIPU', '9') 
('Certified', 'Techexpo2015', 'GGSIPU', '7') 
('RANK', 'Students', 'GGSIPU', '6') 
('SEMESTER', 'CALCULATOR', 'GGSIPU', '9') 
('Certified', 'Application', 'GGSIPU', '7') 
('Application', 'Students', 'GGSIPU', '6') 
('BTECH', 'CALCULATOR', 'GGSIPU', '9') 

行如下面仍然存在

1:( 'GGSIPU', '應用', 'RANK', '7')

2:( 'RANK', '應用', 'GGSIPU', '7')

+1

我看到一個問題陳述,代碼樣本,和輸入樣本,但毫無疑問的。 –

+0

@ Two-BitAlchemist'我想從列表中刪除那些具有相同元素的字符串' –

+0

打開文件時使用'with'的全部要點是上下文管理器會爲您關閉文件。 – chepner

回答

-1
coverting lines into tuples a making sets. 

allLines = set() 

with open('data') as f: 
    for line in f: 
     line = line.strip() 
     line = tuple(line.split(',')) 
     allLines.add(line) 

pp(allLines) 



{('Application', 'Students', 'GGSIPU', '6'), 
('Application', 'Techexpo2015', 'GGSIPU', '7'), 
('BTECH', 'CALCULATOR', 'GGSIPU', '9'), 
('BTECH', 'SEMESTER', 'GGSIPU', '9'), 
('CHECK', 'Application', 'GGSIPU', '7'), 
('CHECK', 'Certified', 'GGSIPU', '7'), 
('CHECK', 'Students', 'GGSIPU', '6'), 
('CHECK', 'Techexpo2015', 'GGSIPU', '7'), 
('CHECK', 'Winner', 'GGSIPU', '7'), 
('Certified', 'Application', 'GGSIPU', '7'), 
('Certified', 'Students', 'GGSIPU', '6'), 
('Certified', 'Techexpo2015', 'GGSIPU', '7'), 
('Certified', 'Winner', 'GGSIPU', '7'), 
('GGSIPU', 'Application', 'RANK', '7'), 
('GGSIPU', 'BTECH', 'RANK', '9'), 
('GGSIPU', 'CALCULATOR', 'RANK', '9'), 
('GGSIPU', 'CHECK', 'RANK', '7'), 
('GGSIPU', 'Certified', 'RANK', '7'), 
('GGSIPU', 'RANK', 'BTECH', '9'), 
('GGSIPU', 'SEMESTER', 'RANK', '9'), 
('GGSIPU', 'Students', 'RANK', '6'), 
('GGSIPU', 'Techexpo2015', 'RANK', '7'), 
('GGSIPU', 'Winner', 'RANK', '7'), 
('RANK', 'Application', 'GGSIPU', '7'), 
('RANK', 'BTECH', 'GGSIPU', '9'), 
('RANK', 'CALCULATOR', 'GGSIPU', '9'), 
('RANK', 'CHECK', 'GGSIPU', '7'), 
('RANK', 'Certified', 'GGSIPU', '7'), 
('RANK', 'SEMESTER', 'GGSIPU', '9'), 
('RANK', 'Students', 'GGSIPU', '6'), 
('RANK', 'Techexpo2015', 'GGSIPU', '7'), 
('RANK', 'Winner', 'GGSIPU', '7'), 
('SEMESTER', 'CALCULATOR', 'GGSIPU', '9'), 
('Techexpo2015', 'Students', 'GGSIPU', '6'), 
('Winner', 'Application', 'GGSIPU', '7'), 
('Winner', 'Students', 'GGSIPU', '6'), 
('Winner', 'Techexpo2015', 'GGSIPU', '7')} 
0
with open('C:\Users\DELL\Documents\itemset3.txt', 'r') as f: 
    lines = f.readlines() 
    f.close() 

linesUp = [] 
for line in lines: 
    linesUp.append(tuple(line.replace("\n","").split(','))) 

setOfLines = set(linesUp) 

我已經從,分割的字符串構造了元組,並將它們放入列表中。然後結束創建一個只消除重複的集合。

使用替換字符串line因爲幾個字符串沒有新的線路出奇你的數據。

我有一個小的數據集的工作。希望它會爲你工作

+0

這會解決我的問題嗎? – TheLinuxEvangelist

+0

是的,它確實解決了你的問題 – saikumarm

+0

它沒有解決問題..仍然有重複的元組.. – TheLinuxEvangelist