2013-04-27 33 views
0

我有一些文本文件包含可變列號,由\t(製表符)分隔。類似這樣的:Python - 確定字符串的頻率和進一步處理

value1x1 . . . . . . value1xn 
    .  . . . . . . value2xn 
    .  . . . . . .  . 
valuemx1 . . . . . . valuemxn 

我可以掃描並通過下面的代碼確定值的頻率;這

f2 = open("out_freq.txt", 'w') 
f = open("input_raw",'r') 
whole_content = (f.read()) 
list_content = whole_content.split() 
dict = {} 
for one_word in list_content: 
    dict[one_word] = 0 
for one_word in list_content: 
    dict[one_word] += 1 
a = str(sorted(dict.items(),key=func)) 
f2.write(a) 
f2.close() 

和輸出如下:

('26047', 13), ('42810', 13), ('61080', 13), ('106395', 13), ('102395', 13)... 

語法是('value', occurence_number),它按預期工作。我想實現的是:

  1. 要轉換輸出語法如下:('value', occurrence_number, column_number)其中列數是發生在input_raw.txt這個值

  2. 要使用同一組值的列數出現的次數來分隔欄並將它們寫入不同的文件

+4

什麼'collections.Counter'? – squiguy 2013-04-27 17:01:35

+1

如果您想跟蹤有關列的信息*爲什麼*您不是逐行讀取文件或者至少逐行處理內容?另外,如果同一個鍵在不同列中出現多次,會發生什麼情況? – Bakuriu 2013-04-27 17:02:22

+0

for line in「input_raw」 if search_string in line: 我用這個表達式逐行讀取,但是如果找到search_string,它會退出掃描當前行。對於相同的search_string在不同列的input_raw中找到的情況,這不起作用。 – y33t 2013-04-27 17:07:18

回答

0

如果我理解你想要的東西類似如下:

import itertools as it 
from collections import Counter 

with open("input_raw",'r') as fin, open("out_freq.txt", 'w') as fout: 
    counts = Counter(it.chain.from_iterable(enumerate(line.split()) 
                for line in fin)) 
    sorted_items = sorted(counts.items(), key=lambda x: x[1], reverse=True) 
    a = ', '.join(str((int(key[1]), val, key[0])) for key, val in sorted_items)) 
    fout.write(a) 

請注意,此代碼使用元組作爲關鍵字來區分相同的值(如果它們出現在不同的列中)。從你的問題不清楚這是否可能以及在這種情況下應該做什麼。

用法示例:

>>> import itertools as it 
>>> from collections import Counter 
>>> def get_sorted_items(fileobj): 
...  counts = Counter(it.chain.from_iterable(enumerate(line.split()) for line in fileobj)) 
...  return sorted(counts.items(), key=lambda x:x[1], reverse=True) 
... 
>>> data = """ 
... 10 11 12 13 14 
... 10 9 7 6 4 
... 9 8 12 13 0 
... 10 21 33 6 1 
... 9 9 7 13 14 
... 1 21 7 13 0 
... """ 
>>> with open('input.txt', 'wt') as fin: #write data to the input file 
...  fin.write(data) 
... 
>>> with open('input.txt', 'rt') as fin: 
...  print ', '.join(str((int(key[1]), val, key[0])) for key, val in get_sorted_items(fin)) 
... 
(13, 4, 3), (10, 3, 0), (7, 3, 2), (14, 2, 4), (6, 2, 3), (9, 2, 0), (0, 2, 4), (9, 2, 1), (21, 2, 1), (12, 2, 2), (8, 1, 1), (1, 1, 4), (1, 1, 0), (33, 1, 2), (4, 1, 4), (11, 1, 1) 
+0

我得到了=','.join(str((key [1],val,鍵= [0]))鍵,val在sorted_items)TypeError:'枚舉'對象沒有屬性'__getitem__' – y33t 2013-04-28 10:14:18

+0

@ y33t我忘了把它轉換成一個元組。現在它應該工作。 – Bakuriu 2013-04-29 10:45:35

+0

'fout.close'是多餘的... – 2013-04-29 10:54:34