2016-02-02 264 views
2

我們目前正在與下面的表架構測試卡桑德拉:卡桑德拉:查找分區鍵

CREATE TABLE coreglead_v2.stats_by_site_user (
    d_tally text, -- ex.: '2016-01', '2016-02', etc.. 
    site_id int, 
    d_date timestamp, 
    site_user_id int, 
    accepted counter, 
    error counter, 
    impressions_negative counter, 
    impressions_positive counter, 
    rejected counter, 
    revenue counter, 
    reversals_rejected counter, 
    reversals_revenue counter, 
    PRIMARY KEY (d_tally, site_id, d_date, site_user_id) 
) WITH CLUSTERING ORDER BY (site_id ASC, d_date ASC, site_user_id ASC) 
    AND bloom_filter_fp_chance = 0.01 
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} 
    AND comment = '' 
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} 
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} 
    AND crc_check_chance = 1.0 
    AND dclocal_read_repair_chance = 0.1 
    AND default_time_to_live = 0 
    AND gc_grace_seconds = 864000 
    AND max_index_interval = 2048 
    AND memtable_flush_period_in_ms = 0 
    AND min_index_interval = 128 
    AND read_repair_chance = 0.0 
    AND speculative_retry = '99PERCENTILE'; 

對於我們的測試目的,我們已經寫了randomises跨越2016年的日曆數據(12個月計)Python腳本,我們預計我們的分區密鑰將爲d_tally列,與此同時,我們預計我們的密鑰數量爲12(從「2016-01」到「2016-12」)。

運行nodetool cfstats向我們展示,雖然以下幾點:

Table: stats_by_site_user 
     SSTable count: 4 
     Space used (live): 131977793 
     Space used (total): 131977793 
     Space used by snapshots (total): 0 
     Off heap memory used (total): 89116 
     SSTable Compression Ratio: 0.18667406304929424 
     Number of keys (estimate): 24 
     Memtable cell count: 120353 
     Memtable data size: 23228804 
     Memtable off heap memory used: 0 
     Memtable switch count: 10 
     Local read count: 169 
     Local read latency: 1.938 ms 
     Local write count: 4912464 
     Local write latency: 0.066 ms 
     Pending flushes: 0 
     Bloom filter false positives: 0 
     Bloom filter false ratio: 0.00000 
     Bloom filter space used: 128 
     Bloom filter off heap memory used: 96 
     Index summary off heap memory used: 76 
     Compression metadata off heap memory used: 88944 
     Compacted partition minimum bytes: 5839589 
     Compacted partition maximum bytes: 43388628 
     Compacted partition mean bytes: 16102786 
     Average live cells per slice (last five minutes): 102.91627247589237 
     Maximum live cells per slice (last five minutes): 103 
     Average tombstones per slice (last five minutes): 1.0 
     Maximum tombstones per slice (last five minutes): 1 

什麼是困惑我們的是 「鍵(估計)數量:24」 的一部分。看看我們的模式,並假設我們的測試數據(超過500萬次寫入)僅由2016年的數據組成,那麼24個關鍵點估計來自哪裏?

這裏是我們的數據的一個例子:

d_tally | site_id | d_date     | site_user_id | accepted | error | impressions_negative | impressions_positive | rejected | revenue | reversals_rejected | reversals_revenue 
---------+---------+--------------------------+--------------+----------+-------+----------------------+----------------------+----------+---------+--------------------+------------------- 
2016-01 |  1 | 2016-01-01 00:00:00+0000 |  240054 |  1 | null |     null |     1 |  null |  553 |    null |    null 
2016-01 |  1 | 2016-01-01 00:00:00+0000 |  1263968 |  1 | null |     null |     1 |  null | 1093 |    null |    null 
2016-01 |  1 | 2016-01-01 00:00:00+0000 |  1267841 |  1 | null |     null |     1 |  null |  861 |    null |    null 
2016-01 |  1 | 2016-01-01 00:00:00+0000 |  1728725 |  1 | null |     null |     1 |  null |  425 |    null |    null 
+1

http://stackoverflow.com/questions/27963951/understanding-number-of-keys-in-nodetool-cfstats –

回答

2

密鑰的數量是一個估計值(雖然應該非常接近)。它會繪製每個sstable的數據草圖,並將它合併在一起以估計基數(hyperloglog)。

不幸的是,memtable中不存在等價物,所以它將memtable的基數添加到sstable估計值中。這意味着memtables和sstable中的東西都會被重複計算。這就是爲什麼你看到24而不是12