聚合來自NDB數據存儲的數據的最佳方式是什麼？

我有一個StatisticStore模型定義爲：聚合來自NDB數據存儲的數據的最佳方式是什麼？

class StatisticStore(ndb.Model): 
    user = ndb.KeyProperty(kind=User) 
    created = ndb.DateTimeProperty(auto_now_add=True) 
    kind = ndb.StringProperty() 
    properties = ndb.PickleProperty() 

    @classmethod 
    def top_links(cls, user, start_date, end_date): 
    ''' 
    returns the user's top links for the given date range 
    e.g. 
    {'http://stackoverflow.com': 30, 
    'http://google.com': 10, 
    'http://yahoo.com': 15} 
    ''' 
    stats = cls.query(
     cls.user == user.key, 
     cls.created >= start_date, 
     cls.created <= end_date, 
     cls.kind == 'link_visited' 
    ) 
    links_dict = {} 
    # generate links_dict from stats 
    # keys are from the 'properties' property 
    return links_dict

我想有一個AggregateStatisticStore模型存儲的StatisticStore每天聚集。它可以每天產生一次。喜歡的東西：

class AggregateStatisticStore(ndb.Model): 
    user = ndb.KeyProperty(kind=User) 
    date = ndb.DateProperty() 
    kinds_count = ndb.PickleProperty() 
    top_links = ndb.PickleProperty()

因此，以下將是真實的：

start = datetime.datetime(2013, 8, 22, 0, 0, 0) 
end = datetime.datetime(2013, 8, 22, 23, 59, 59) 

aug22stats = StatisticStore.query(
    StatisticStore.user == user, 
    StatisticStore.kind == 'link_visited', 
    StatisticStore.created >= start, 
    StatisticStore.created <= end 
).count() 
aug22toplinks = StatisticStore.top_links(user, start, end) 

aggregated_aug22stats = AggregateStatisticStore.query(
    AggregateStatisticStore.user == user, 
    AggregateStatisticStore.date == start.date() 
) 

aug22stats == aggregated_aug22stats.kinds_count['link_visited'] 
aug22toplinks == aggregated_aug22stats.top_links

我想只是運行與任務隊列API一個cronjob的。該任務將生成每天的AggregateStatisticStore。但我擔心它可能會遇到內存問題？看起來像StatisticStore可能有很多記錄每個用戶。

另外，top_links屬性類複雜化了一下。我不確定如果在聚合模型中擁有它的屬性是最好的方法。任何關於該財產的建議都會很好。

最終我只想記錄StatisticStore直到大約30天前。如果記錄超過30天，則應該將其彙總（然後刪除）。節省空間並改善可視化的查詢時間。

編輯：每次記錄StatisticStore時，它會創建/更新適當的AggregateStatisticStore記錄。這樣，所有的cronjob所要做的就是清理。思考？

來源

2013-08-22 john2x

你看過mapreduce api嗎？但是，更新StatistcStore時更新AgregateStatisticStore可能是更好的主意。儘管您可能想要分割AgregateStatisticStore，但這可能取決於您的性能需求。如果StatisticStore未針對給定用戶頻繁更新，則可能不需要對其進行分片。 – dragonx

是的，我已經看過mapreduce，但是我很難對它進行維護。「分片」AggregateStatisticStore是什麼意思？ – john2x

@ john2x我以爲你不能在查詢中過濾多個屬性？您的查詢是否有效？ – Erevald

是的，mapreduce會對此很好。或者，您可以使用「後端」（現在是模塊）實例運行您的cron作業。這可能會緩解內存問題和工作長度問題。

另一種方法可能是將聚合移動到寫入時間。由於這是每個用戶，你可能會發現你以這種方式消除了大量工作。如果AggregateStatisticStore是每天，您可能需要使用DateProperty以外的日期。 DateProperty當然會起作用，但是我發現使用IntegerProperty來處理這種int只是「一段時間以來」的事情會更容易。

來源

2013-08-22 14:02:46 Jay

通過「移動聚合寫入時間」，你的意思是像我的編輯？謝謝，關於DateProperty的好點。 – john2x

通過寫入時間，我的意思是當用戶添加鏈接，對鏈接進行投票或者進行任何使其成爲「頂級鏈接」的操作。所以當他們這樣做時，你可能會創建一個任務來更新聚合。再次，這將不是使用cron來做同樣的事情。 – Jay

一定的相關彙總數據：

變化StatisticStore和AggregateStatisticStore有user.key其父。這意味着從每個型號中刪除user = ndb.KeyProperty(kind=User)，使用parent = user.key創建每個模型，並在query()中使用parent = user.key。 NDB善於與同一家長彙總數據。

來源

2013-08-22 21:00:44

如果AggregateStatisticScore彼此獨立，則不需要使用MapReduce。如果你可以爲每個用戶運行一個循環，只需爲每個用戶運行一個taskqueue進程並寫一條記錄。這只是「地圖」階段。

如果你可以將它分解成更多的並行任務，那麼創建更多的任務隊列進程。「並行化」它！

來源

2013-08-22 21:11:00

聚合來自NDB數據存儲的數據的最佳方式是什麼？

回答

相關問題