2013-09-27 48 views
0

將樣品輸入文件(實際輸入文件包含大約50,000個條目):要根據條件形成羣集?

615 146 
615 180 
615 53 
615 42 
615 52 
615 52 
615 51 
615 45 
615 49 
616 34 
616 44 
616 42 
616 41 
616 42 
617 42 
617 43 
617 42 
685 33 
685 33 
685 33 
686 33 
686 33 
687 47 
687 68 
737 449 
737 41 
737 1138 
738 46 
738 53 

我必須在列中的每個值與相同的值等615615615比較必須被分組在一起羣集必須包含像146180 COLUMN1值.. ...... 45,49則羣集必須打破&形式的另一個羣集爲下一組相同的值616616616 ..........的等

我寫的代碼是:

from __future__ import division 
from sys import exit 
h = 0 
historyjobs = [] 
targetjobs = [] 


def quickzh(zhlistsub, 
    targetjobs=targetjobs,num=0,denom=0): 

li = [] ; ji = [] 
j = 0 
for i in zhlistsub: 
    x1 = targetjobs[j][0] 

    x = targetjobs[i][0] 

    num += x 
    denom += 1 
    if x1 >= 0.9 * (num/denom):#to group all items with same value in column 0 
     li.append(targetjobs[i][1]) 
    else: 
     break  
return li 


def filewr(listli): 
global h 
s = open("newout1","a") 
if(len(listli) != 0): 
     h += 1 
     s.write("cluster: %d"%h) 
     s.write("\n") 
     s.write(str(listli)) 
     s.write("\n\n") 
else: 
     print "0" 


def new(inputfile, 
historyjobs=historyjobs,targetjobs=targetjobs): 
zhlistsub = [];zhlist = [] 
k = 0 

with open(inputfile,'r') as f: 
    for line in f: 
     job = map(int,line.split()) 
     targetjobs.append(job) 
    while True: 
    if len(targetjobs) != 0: 

     zhlistsub = [i for i, element in enumerate(targetjobs)] 

     if zhlistsub: 
      listrun = quickzh(zhlistsub) 
      filewr(listrun) 
     historyjobs.append(targetjobs.pop(0)) 
     k += 1 
    else: 
     break 

new('newfinal1') 

輸出,我得到的是:

cluster: 1 
[146, 180, 53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53] 

cluster: 2 
[180, 53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53] 

cluster: 3 
[53, 42, 52, 52, 51, 45, 49, 34, 44, 42, 41, 42, 42, 43, 42, 33, 33, 33, 33, 33, 47, 68, 449, 41, 1138, 46, 53] 
..................so on 

但是,我需要輸出爲:

cluster: 1 
    [146, 180, 53, 42, 52, 52, 51, 45, 49] 
    cluster: 2 
    [34, 44, 42, 41, 42] 
    cluster: 3 
    [42, 43, 42] 
    _____________________ so on 

所以任何人都可以建議我應該做哪些改變來調節,以獲得所需的結果。它是真的有用嗎?

+3

我有一個真正艱難的時間,瞭解你需要什麼...但通常對於分組,'itertools.groupby'或者'collections.defaultdict'是要走的路... – mgilson

回答

1

試試這個,groupby負責創建羣的照顧,所有剩下要做的就是建立名單:

import itertools as it 
[[y[1] for y in x[1]] for x in it.groupby(data, key=lambda x:x[0])] 

上述假設data是你輸入所在,而且它已經過濾和排序由第一列。對於這個問題的例子,它看起來像這樣:

data = [[615, 146], [615, 180], [615, 53] ... ] 
+0

如果x1> = 0.9 *(num/denom),你可以在我的if if條件中提出一些條件:''提供結果。 –

+0

我的答案有助於構建羣集,但尚不清楚如何使用該條件過濾值。我只能建議你將問題分成兩部分,首先過濾掉輸入,在我的例子中建立一個列表作爲'data',然後用上面的列表理解建立集羣 –

1

沒有測試的答案,但按照這個概念

import collections.defaultdict 

cluster=defaultdict(list) 

with open(inputfile,'r') as f: 
    for line in f: 
     clus, val = line.split() 
     cluster[clus].append(val) 

for clus, val in cluster: 
    print "cluster" +str(clus)+"\n" 
    print str(val)+"\n"