2012-10-24 44 views
1

我正在使用itertools.groupby解析製表符分隔的短文本文件。該文本文件有幾列,我想要做的就是將特定列中具有特定值x的所有條目分組。下面的代碼對於名爲name2的列執行此操作,查找變量x中的值。我試圖用csv.DictReaderitertools.groupby來做到這一點。在表中,有行符合這個標準,所以應該返回8個條目。相反,groupby返回兩組條目,一個條目爲單條,另一條條目爲7,這看起來是錯誤的行爲。我在相同的數據匹配手動下方得到正確的結果:擾亂python itertools groupby中的奇怪行爲/錯誤?

import itertools, operator, csv 
col_name = "name2" 
x = "ENSMUSG00000002459" 
print "looking for entries with value %s in column %s" %(x, col_name) 
print "groupby gets it wrong: " 
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames) 
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)): 
    if name == "ENSMUSG00000002459": 
     wrong_result = [e for e in entries] 
     print "wrong result has %d entries" %(len(wrong_result)) 
print "manually grouping entries is correct: " 
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames) 
correct_result = [] 
for row in data: 
    if row[col_name] == "ENSMUSG00000002459": 
     correct_result.append(row) 
print "correct result has %d entries" %(len(correct_result)) 

我得到的輸出是:

looking for entries with value ENSMUSG00000002459 in column name2 
groupby gets it wrong: 
wrong result has 7 entries 
wrong result has 1 entries 
manually grouping entries is correct: 
correct result has 8 entries 

這到底是怎麼回事呢?如果groupby確實是分組,那麼看起來我應該只按x獲得一組條目,但是它返回兩個。我無法弄清楚這一點。 編輯:啊,它應該排序。

+0

這是關於'GROUPBY如何'()''的作品,但它是記錄的行爲,我建議一種常見的誤解你更仔細地閱讀文檔。 –

回答

3

你會想改變你的代碼,迫使該數據是根據索引順序...

data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames) 
sorted_data = sorted(data, key=operator.itemgetter(col_name)) 
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)): 
    pass # whatever 

主要用途雖然是當數據集大,數據已經按鍵的順序,所以當你無論如何都要進行排序,然後使用defaultdict更有效

from collections import defaultdict 
name_entries = defaultdict(list) 
for row in data: 
    name_entries[row[col_name]].append(row) 
3

根據相同的密鑰的documentation,只有groupby()組連續出​​現。