轉化點對成元組,而不是一個列表,允許我們使用的每個點作爲字典的鍵(因爲鍵必須是可哈希),其起源的集羣價值
data = [[10,10],[20,20],[5,3],[7,2],[90,78]]
clusters = [[[10,10], [20,20]], [[5,3], [7,2], [90,78]]]
cd = dict((tuple(p), i) for i, cl in enumerate(clusters) for p in cl)
output = [cd[tuple(p)] for p in data] # output = list(map(lambda p: cd[tuple(p)], data))
print(output)
產生
[0, 0, 1, 1, 1]
性能
出於好奇,我比較了我的解決方案與@ Back2B asics',以瞭解這兩家公司如何隨着數據量的不斷增長而擴展。
下面的效用函數產生的N x N
點的數據集,並在大小相等
def gen_data(n):
return [[x, y] for x in range(n) for y in range(n)]
def gen_clusters(data, c_size):
g = [iter(data)]*c_size
return [list(c) for c in zip(*g)]
的集羣分區,這是兩個解決方案:
def ck_pynchia(data, clusters):
cd = dict((tuple(p), i) for i, cl in enumerate(clusters) for p in cl)
output = [cd[tuple(p)] for p in data]
return output
def ck_b2b(data, clusters):
gen = (count for count, y in enumerate(clusters) for x in data if x in y)
output = list(gen)
return output
現在我們的時間裏他們
數據集爲10 * 10分
python3 -mtimeit 'import clusters as cl; data = cl.gen_data(10); clusters = cl.gen_clusters(data, 10); cl.ck_pynchia(data,clusters)'
10000 loops, best of 3: 73.2 usec per loop
python3 -mtimeit 'import clusters as cl; data = cl.gen_data(10); clusters = cl.gen_clusters(data, 10); cl.ck_b2b(data,clusters)'
1000 loops, best of 3: 330 usec per loop
用50個* 50點
python3 -mtimeit 'import clusters as cl; data = cl.gen_data(50); clusters = cl.gen_clusters(data, 50); cl.ck_pynchia(data,clusters)'
1000 loops, best of 3: 1.59 msec per loop
python3 -mtimeit 'import clusters as cl; data = cl.gen_data(50); clusters = cl.gen_clusters(data, 50); cl.ck_b2b(data,clusters)'
10 loops, best of 3: 177 msec per loop
與100個* 100點
python3 -mtimeit 'import clusters as cl; data = cl.gen_data(100); clusters = cl.gen_clusters(data, 100); cl.ck_pynchia(data,clusters)'
100 loops, best of 3: 6.37 msec per loop
python3 -mtimeit 'import clusters as cl; data = cl.gen_data(100); clusters = cl.gen_clusters(data, 100); cl.ck_b2b(data,clusters)'
10 loops, best of 3: 2.82 sec per loop
正如我們可以看到一個數據集的數據集,這兩種算法的計算複雜度是相當不同。
最後一點:如果你能在一個元組,而不是一個列表保存點,事情會稍微簡單快捷:
data = [(10,10),(20,20),(5,3),(7,2),(90,78)]
clusters = [[(10,10), (20,20)], [(5,3), (7,2), (90,78)]]
cd = dict((p,i) for i, cl in enumerate(clusters) for p in cl)
output = [cd[p] for p in data] # output = list(map(lambda p: cd[p], data))
重寫'enumerate' :)有趣的方法:)。順便說一句,我認爲你的解決方案模仿@ Back2Basics' – Pynchia