2013-04-05 55 views
3

我想從文本文件中刪除重複的單詞。從文本文件中刪除重複項

我有含有這種類似下面的一些文本文件:

None_None 

ConfigHandler_56663624 
ConfigHandler_56663624 
ConfigHandler_56663624 
ConfigHandler_56663624 

None_None 

ColumnConverter_56963312 
ColumnConverter_56963312 

PredicatesFactory_56963424 
PredicatesFactory_56963424 

PredicateConverter_56963648 
PredicateConverter_56963648 

ConfigHandler_80134888 
ConfigHandler_80134888 
ConfigHandler_80134888 
ConfigHandler_80134888 

所得輸出要求是:

None_None 

ConfigHandler_56663624 

ColumnConverter_56963312 

PredicatesFactory_56963424 

PredicateConverter_56963648 

ConfigHandler_80134888 

我用眼前這個命令: EN =設置(開( 'file.txt') 但它不起作用

任何人都可以幫助我如何從文件中提取唯一的集合

謝謝

+0

看看類似的問題 - http://stackoverflow.com/questions/10860190/python-remove-來自文本文件的副本 – ton1c 2013-04-05 09:29:03

回答

4

下面是關於保留順序(不像一組)的選項,但仍具有相同的行爲(注意EOL字符是故意剝奪和空行被忽略)...

from collections import OrderedDict 

with open('/home/jon/testdata.txt') as fin: 
    lines = (line.rstrip() for line in fin) 
    unique_lines = OrderedDict.fromkeys((line for line in lines if line)) 

print unique_lines.keys() 
# ['None_None', 'ConfigHandler_56663624', 'ColumnConverter_56963312',PredicatesFactory_56963424', 'PredicateConverter_56963648', 'ConfigHandler_80134888'] 

然後你只需需要將以上內容寫入輸出文件。

+0

(line for line in line if line)? WTF! :) – 2017-12-08 06:44:57

1

這裏是你如何可以與套(無序的結果)做到這一點:

from pprint import pprint 

with open('input.txt', 'r') as f: 
    print pprint(set(f.readlines())) 

此外,你可能想擺脫的新行字符。

0

如果你只是想獲得非重複的輸出,你可以使用uniqsort

[email protected]: /tmp() $ sort -nr dup | uniq 
PredicatesFactory_56963424 
PredicateConverter_56963648 
None_None 
ConfigHandler_80134888 
ConfigHandler_56663624 
ColumnConverter_56963312 

對於Python:

In [2]: with open("dup", 'rt') as f: 
    lines = f.readlines() 
    ...:  

In [3]: lines 
Out[3]: 
['None_None\n', 
'\n', 
'ConfigHandler_56663624\n', 
'ConfigHandler_56663624\n', 
'ConfigHandler_56663624\n', 
'ConfigHandler_56663624\n', 
'\n', 
'None_None\n', 
'\n', 
'ColumnConverter_56963312\n', 
'ColumnConverter_56963312\n', 
'\n', 
'PredicatesFactory_56963424\n', 
'PredicatesFactory_56963424\n', 
'\n', 
'PredicateConverter_56963648\n', 
'PredicateConverter_56963648\n', 
'\n', 
'ConfigHandler_80134888\n', 
'ConfigHandler_80134888\n', 
'ConfigHandler_80134888\n', 
'ConfigHandler_80134888\n'] 

In [4]: set(lines) 
Out[4]: 
set(['ColumnConverter_56963312\n', 
    '\n', 
    'PredicatesFactory_56963424\n', 
    'ConfigHandler_56663624\n', 
    'PredicateConverter_56963648\n', 
    'ConfigHandler_80134888\n', 
    'None_None\n']) 
6

下面是使用集從刪除重複一個簡單的解決方案文本文件。

lines = open('workfile.txt', 'r').readlines() 

lines_set = set(lines) 

out = open('workfile.txt', 'w') 

for line in lines_set: 
    out.write(line) 
0
import json 
myfile = json.load(open('yourfile', 'r')) 
uniq = set() 
for p in myfile: 
if p in uniq: 
    print "duplicate : " + p 
    del p 
else: 
    uniq.add(p) 
print uniq 
0

這種方式得到相同的文件指出,放入

import uuid 

def _remove_duplicates(filePath): 
    f = open(filePath, 'r') 
    lines = f.readlines() 
    lines_set = set(lines) 
    tmp_file=str(uuid.uuid4()) 
    out=open(tmp_file, 'w') 
    for line in lines_set: 
    out.write(line) 
    f.close() 
    os.rename(tmp_file,filePath) 
0
def remove_duplicates(infile): 
    storehouse = set() 
    with open('outfile.txt', 'w+') as out: 
     for line in open(infile): 
      if line not in storehouse: 
       out.write(line) 
       storehouse.add(line) 

remove_duplicates('infile.txt') 
+0

儘管此代碼可能會回答問題,但爲何和/或代碼如何回答問題提供了其他上下文,這可以提高其長期價值。不鼓勵使用僅有代碼的答案。 – Ajean 2016-07-29 17:33:44

相關問題