以下是關於你所說的每列大約20個不同的值,除了一個是400.如果內存和加載時間不是一個擔心,那麼我建議爲每列的值創建集。
下面是生成一個示例數據集的東西。
#!/usr/bin/python
from random import sample, choice
from cPickle import dump
# Generate sample dataset
value_ceiling = 1000
dataset_size = 900000
dataset_filename = 'dataset.pkl'
# number of distinct values per column
col_distrib = [400,20,20,20,20,20,20]
col_values = [ sample(xrange(value_ceiling),x) for x in col_distrib ]
dataset = []
for _ in xrange(dataset_size):
dataset.append(tuple([ choice(x) for x in col_values ]))
dump(dataset,open(dataset_filename,'wb'))
下面是加載數據集並創建每列每個值的查找集,搜索方法和樣本搜索的創建。
#/usr/bin/python
from random import sample, choice
from cPickle import load
dataset_filename = 'dataset.pkl'
class DataSearch(object):
def __init__(self,filename):
self.data = load(open(filename,'rb'))
self.col_sets = [ dict() for x in self.data[0] ]
self.process_data()
def process_data(self):
for row in self.data:
for i,v in enumerate(row):
self.col_sets[i].setdefault(v,set()).add(row)
def search(self,*args):
# args are integers, sequences of integers, or None in related column positions.
results = []
for i,v in enumerate(args):
if v is None:
continue
elif isinstance(v,int):
results.append(self.col_sets[i].get(v,set()))
else: # sequence
r = [ self.col_sets[i].get(x,set()) for x in v ]
r = reduce(set.union,r[1:],r[0])
results.append(r)
#
results.sort(key=len)
results = reduce(set.intersection,results[1:],results[0])
return results
def sample_search(self,*args):
search = []
for i,v in enumerate(args):
if v is None:
search.append(None)
else:
search.append(sample(self.col_sets[i].keys(),v))
return search
d = DataSearch(dataset_filename)
,並用它:
>>> d.search(*d.sample_search(1,1,1,5))
set([(117, 557, 273, 437, 639, 981, 587), (117, 557, 273, 170, 53, 640, 467), (117, 557, 273, 584, 459, 127, 649)])
>>> d.search(*d.sample_search(1,1,1,1))
set([])
>>> d.search(*d.sample_search(10,None,1,1,1,1))
set([(801, 334, 414, 283, 107, 990, 221)])
>>> d.search(*d.sample_search(10,None,1,1,1,1))
set([])
>>> d.search(*d.sample_search(10,None,1,1,1,1))
set([(193, 307, 547, 549, 901, 940, 343)])
>>> import timeit
>>> timeit.Timer('d.search(*d.sample_search(10,None,1,1,1,1))','from __main__ import d').timeit(100)
1.787431001663208
1.8秒做100個搜索速度不夠快?
聽起來確實像一個數據庫是不是最好的方式來存儲這個。你有沒有想過使用[Cython](http://cython.org/)?當您添加靜態類型註釋時,存儲效率會更高(您可以使用機器字)並且速度相當快(出於同樣的原因)。 – delnan 2011-01-07 09:44:09
當你試圖SQLite是你的數據庫在內存或磁盤?如果它是在一個文件上嘗試使用`:memory:`,使用sqlite的內存數據庫通常比Python這樣的任務快得多。你的所有專欄索引? – 2011-01-07 09:47:52
我同意DB解決方案在這裏並不是最好的選擇,我更多地將它用作基準測試工具。我正在做一個c-extension作爲一個實驗。然而,我想知道在Numpy中是否有一種有效的方式來做到這一點,對我來說這是一個未知的領域。 – c00kiemonster 2011-01-07 09:49:29