爲什麼我的數據插入到我的cassandra數據庫中這麼慢？

這是我的查詢，如果當前數據ID在卡桑德拉數據庫存在或不存在爲什麼我的數據插入到我的cassandra數據庫中這麼慢？

row = session.execute("SELECT * FROM articles where id = %s", [id])

在卡夫卡解決消息，則確定在卡桑德拉數據庫中是否存在該消息，如果它不存在，則它應該執行一個插入操作，如果它存在的話，它不應該被插入到數據中。

messages = consumer.get_messages(count=25) 

    if len(messages) == 0: 
     print 'IDLE' 
     sleep(1) 
     continue 

    for message in messages: 
     try: 
      message = json.loads(message.message.value) 
      data = message['data'] 
      if data: 
       for article in data: 
        source = article['source'] 
        id = article['id'] 
        title = article['title'] 
        thumbnail = article['thumbnail'] 
        #url = article['url'] 
        text = article['text'] 
        print article['created_at'],type(article['created_at']) 
        created_at = parse(article['created_at']) 
        last_crawled = article['last_crawled'] 
        channel = article['channel']#userid 
        category = article['category'] 
        #scheduled_for = created_at.replace(minute=created_at.minute + 5, second=0, microsecond=0) 
        scheduled_for=(datetime.utcnow() + timedelta(minutes=5)).replace(second=0, microsecond=0) 
        row = session.execute("SELECT * FROM articles where id = %s", [id]) 
        if len(list(row))==0: 
        #id parse base62 
         ids = [id[0:2],id[2:9],id[9:16]] 
         idstr='' 
         for argv in ids: 
          num = int(argv) 
          idstr=idstr+encode(num) 
         url='http://weibo.com/%s/%s?type=comment' % (channel,idstr) 
         session.execute("INSERT INTO articles(source, id, title,thumbnail, url, text, created_at, last_crawled,channel,category) VALUES (%s,%s, %s, %s, %s, %s, %s, %s, %s, %s)", (source, id, title,thumbnail, url, text, created_at, scheduled_for,channel,category)) 
         session.execute("INSERT INTO schedules(source,type,scheduled_for,id) VALUES (%s, %s, %s,%s) USING TTL 86400", (source,'article', scheduled_for, id)) 
         log.info('%s %s %s %s %s %s %s %s %s %s' % (source, id, title,thumbnail, url, text, created_at, scheduled_for,channel,category)) 


     except Exception, e: 
      log.exception(e) 
      #log.info('error %s %s' % (message['url'],body)) 
      print e 
      continue

編輯：

我有一個ID只具有一個獨特的錶行，我想是這樣的。只要爲唯一ID添加不同的scheduled_for次數，我的系統就會崩潰。如果len（list（row））== 0：加上這個是正確的想法，但是之後我的系統非常慢。

這是我的表說明：

DROP TABLE IF EXISTS schedules; 

CREATE TABLE schedules (
source text, 
type text, 
scheduled_for timestamp, 
id text, 
PRIMARY KEY (source, type, scheduled_for, id) 
);

這scheduled_for是可變的。這裏也是一個具體的例子

Hao article 2016-01-12 02:09:00+0800 3930462206848285 
Hao article 2016-01-12 03:09:00+0801 3930462206848285 
Hao article 2016-01-12 04:09:00+0802 3930462206848285 
Hao article 2016-01-12 05:09:00+0803 3930462206848285

感謝您的回覆！

來源

2016-01-11 peter

考慮到寫操作是價格低廉，而讀操作可能不是，我想優化你正在試圖做的那種是沒有意義的。 – Ralf

@Ralf好吧，那麼會有什麼建議呢？感謝您的回覆！ – peter

只需再次插入記錄？或者至少不要從表中選擇*而只選擇ID。這樣你節省了一些網絡帶寬。（我認爲Cassandra仍然加載整行;也許有人可以對此進行評論。）根據您的應用程序，在插入之前選擇每行都有addtl。稀釋卡桑德拉的緩存，降低用戶的閱讀性能。 – Ralf

爲什麼不使用insert if not exists？

https://docs.datastax.com/en/cql/3.1/cql/cql_reference/insert_r.html

來源

2016-01-11 08:59:07

請注意，「IF NOT EXISTS」也帶有性能損失。 – Ralf

我完全同意，這是一個典型的「先寫後讀」的情況，但至少從應用程序的角度來看更容易，也許更加優化。 –

我認爲它比寫前寫的「只是」更糟糕。對於IF NOT EXIST來說，Cassandra必須確保[整個集羣的一致性]（https://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_tunable_consistency_c.html）。如果您爲集羣設置最終一致性，則會喪失該設置的所有性能優勢。但我認爲IF NOT EXISTS不會混淆緩存內容。 – Ralf

爲什麼我的數據插入到我的cassandra數據庫中這麼慢？

回答

相關問題