2014-03-27 27 views
1

我們正在運行Datastax企業4.0.1和插入行的卡珊德拉當運行到一個很奇怪的問題,然後爲COUNT(1)查詢蜂巢。DSE 4.0.1:比卡桑德拉不同的配置單元數計數

設置:DSE 4.0.01,Cassandra 2.0,Hive,全新羣集。插入10,000行到卡桑德拉然後:

cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000; 

count 
------- 
10000 

(1 rows) 

cqlsh:pageviews> 

但是從蜂巢:

hive> select count(1) from pageviews_v1 limit 100000; 
Total MapReduce jobs = 1 
Launching Job 1 out of 1 
Number of reduce tasks determined at compile time: 1 
In order to change the average load for a reducer (in bytes): 
    set hive.exec.reducers.bytes.per.reducer=<number> 
In order to limit the maximum number of reducers: 
    set hive.exec.reducers.max=<number> 
In order to set a constant number of reducers: 
    set mapred.reduce.tasks=<number> 
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002 
Kill Command = /usr/bin/dse hadoop job -kill job_201403272330_0002 
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1 
2014-03-27 23:38:22,129 Stage-1 map = 0%, reduce = 0% 
<snip> 
2014-03-27 23:38:49,324 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.31 sec 
MapReduce Total cumulative CPU time: 11 seconds 310 msec 
Ended Job = job_201403272330_0002 
MapReduce Jobs Launched: 
Job 0: Map: 4 Reduce: 1 Cumulative CPU: 11.31 sec HDFS Read: 0 HDFS Write: 0 SUCCESS 
Total MapReduce CPU Time Spent: 11 seconds 310 msec 
OK 
1723 
Time taken: 38.634 seconds, Fetched: 1 row(s) 

只有1723行。我很困惑。該CQL3的ColumnFamily定義是:

CREATE TABLE pageviews_v1 (
    website text, 
    date text, 
    created timestamp, 
    browser_id text, 
    ip text, 
    referer text, 
    user_agent text, 
    PRIMARY KEY ((website, date), created, browser_id) 
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND 
    bloom_filter_fp_chance=0.001000 AND 
    caching='KEYS_ONLY' AND 
    comment='' AND 
    dclocal_read_repair_chance=0.000000 AND 
    gc_grace_seconds=864000 AND 
    index_interval=128 AND 
    read_repair_chance=1.000000 AND 
    replicate_on_write='true' AND 
    populate_io_cache_on_flush='false' AND 
    default_time_to_live=0 AND 
    speculative_retry='NONE' AND 
    memtable_flush_period_in_ms=0 AND 
    compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND 
    compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'}; 

而且在蜂房有:

CREATE EXTERNAL TABLE pageviews_v1(
    website string COMMENT 'from deserializer', 
    date string COMMENT 'from deserializer', 
    created timestamp COMMENT 'from deserializer', 
    browser_id string COMMENT 'from deserializer', 
    ip string COMMENT 'from deserializer', 
    referer string COMMENT 'from deserializer', 
    user_agent string COMMENT 'from deserializer') 
ROW FORMAT SERDE 
    'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe' 
STORED BY 
    'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler' 
WITH SERDEPROPERTIES (
    'serialization.format'='1', 
    'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua') 
LOCATION 
    'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1' 
TBLPROPERTIES (
    'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner', 
    'cassandra.ks.name'='pageviews', 
    'cassandra.cf.name'='pageviews_v1', 
    'auto_created'='true') 

任何人都經歷相似?

回答

0

該問題似乎與羣集ORDERY BY。刪除可解決COUNT與Hive誤報的問題。

0

這可能是根據this document在HIVE表上的一致性設置。

0

將配置單元查詢更改爲「從pageviews_v1中選擇count(*);」