Cassandra節點工具狀態在具有太多待定壓縮任務的不同節點上不一致

我有一個包含四個節點的cassandra 2.0.6集羣。卡桑德拉遭遇不一致問題。我使用nodetool狀態來檢查每個節點上的狀態。結果不一致。除此狀態命令運行速度非常慢。以下是每個節點上的命令結果。Cassandra節點工具狀態在具有太多待定壓縮任務的不同節點上不一致

具有ip 192.168.148.181和192.168.148.121的節點是種子節點。集羣從未運行過修復。

此外，181和121上的CPU使用率非常高，並且日誌顯示CMS GC在這些節點上非常頻繁。我斷開了所有客戶端，並且沒有讀取和寫入負載。這種一致性和高GC仍然存在。

那麼如何調試和優化這個集羣呢？

[[email protected] apache-cassandra-2.0.16]$ time bin/nodetool status 
Note: Ownership information does not include topology; for complete information, specify a keyspace 
Datacenter: DC1 
=============== 
Status=Up/Down 
|/ State=Normal/Leaving/Joining/Moving 
-- Address   Load  Tokens Owns Host ID        Rack 
UN 192.168.148.121 10.86 GB 1  25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2 
UN 192.168.148.181 10.53 GB 1  25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1 
DN 192.168.148.182 10.95 GB 1  25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4 
UN 192.168.148.221 10.49 GB 1  25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3 

real 8m50.506s 
user 39m48.718s 
sys  76m48.566s 
-------------------------------------------------------------------------------- 
[[email protected] apache-cassandra-2.0.16]$ time bin/nodetool status 
Note: Ownership information does not include topology; for complete information, specify a keyspace 
Datacenter: DC1 
=============== 
Status=Up/Down 
|/ State=Normal/Leaving/Joining/Moving 
-- Address   Load  Tokens Owns Host ID        Rack 
DN 192.168.148.121 10.86 GB 1  25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2 
UN 192.168.148.181 10.53 GB 1  25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1 
DN 192.168.148.182 10.95 GB 1  25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4 
UN 192.168.148.221 10.49 GB 1  25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3 

real 0m15.075s 
user 0m1.606s 
sys  0m0.393s 
---------------------------------------------------------------------- 
[[email protected] apache-cassandra-2.0.16]$ time bin/nodetool status 
Note: Ownership information does not include topology; for complete information, specify a keyspace 
Datacenter: DC1 
=============== 
Status=Up/Down 
|/ State=Normal/Leaving/Joining/Moving 
-- Address   Load  Tokens Owns Host ID        Rack 
DN 192.168.148.121 10.86 GB 1  25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2 
UN 192.168.148.181 10.53 GB 1  25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1 
UN 192.168.148.182 10.95 GB 1  25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4 
UN 192.168.148.221 10.49 GB 1  25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3 

real 0m25.719s 
user 0m2.152s 
sys  0m1.228s 
------------------------------------------------------------------------- 

[[email protected] apache-cassandra-2.0.16]$ time bin/nodetool status 
Note: Ownership information does not include topology; for complete information, specify a keyspace 
Datacenter: DC1 
=============== 
Status=Up/Down 
|/ State=Normal/Leaving/Joining/Moving 
-- Address   Load  Tokens Owns Host ID        Rack 
DN 192.168.148.121 10.86 GB 1  25.0% 1d9ba597-c404-481f-af2b-436493c57227 RAC2 
DN 192.168.148.181 10.53 GB 1  25.0% 5d90300f-2fb4-4065-9819-10ece285223d RAC1 
UN 192.168.148.182 10.95 GB 1  25.0% bcb550df-9429-4cae-9fd2-0bfeea9a5649 RAC4 
DN 192.168.148.221 10.49 GB 1  25.0% 6867f8b4-1f54-48fc-aaae-da71bc251970 RAC3 

real 0m17.581s 
user 0m1.843s 
sys  0m1.632s

我打印GC的對象的詳細信息：

num  #instances   #bytes class name 
---------------------------------------------- 
    1:  58584535  1874705120 java.util.concurrent.FutureTask 
    2:  58585802  1406059248 java.util.concurrent.Executors$RunnableAdapter 
    3:  58584601  1406030424 java.util.concurrent.LinkedBlockingQueue$Node 
    4:  58584534  1406028816 org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask 
    5:  214682  24087416 [B 
    6:  217294  10430112 java.nio.HeapByteBuffer 
    7:   37591  5977528 [C 
    8:   41843  5676048 <constMethodKlass> 
    9:   41843  5366192 <methodKlass> 
    10:   4126  4606080 <constantPoolKlass> 
    11:  100060  4002400 org.apache.cassandra.io.sstable.IndexHelper$IndexInfo 
    12:   4126  2832176 <instanceKlassKlass> 
    13:   4880  2686216 [J 
    14:   3619  2678784 <constantPoolCacheKlass>

我一個節點上使用nodetool cfstats，發現不少compactions任務都在3天內已累計（我重新啓動集羣4天前）

[[email protected] apache-cassandra-2.0.16]$ bin/nodetool compactionstats 
pending tasks: 64642341 
Active compaction remaining time :  n/a

我檢查了compactionhistory。這是結果的一部分。它顯示了很多與按鍵空間系統相關的記錄。

Compaction History: 
id          keyspace_name  columnfamily_name   compacted_at    bytes_in  bytes_out  rows_merged 
8e4f8830-b04f-11e5-a211-45b7aa88107c  system    sstable_activity    1451629144115    3342   915   {4:23} 
96a6fcb0-b04b-11e5-a211-45b7aa88107c  system    hints      145162744{1:1} 
7c42c940-adac-11e5-8bd4-45b7aa88107c  system    hints      1451339203540    56969835  56782732  {2:3} 
585b97a0-ad98-11e5-8bd4-45b7aa88107c  system    sstable_activity    1451330553370    3700   956   {4:24} 
aefc3f10-b1b2-11e5-a211-45b7aa88107c  system    sstable_activity    1451781670273    3201   906   {4:23} 
1e76f1b0-b180-11e5-a211-45b7aa88107c  system    sstable_activity    1451759952971    3303   700   {4:23} 
e7b75b70-aec2-11e5-8bd4-45b7aa88107c  system    hints      1451458783911    57690316  57497847  {2:3} 
ad102280-af6d-11e5-b1dc-45b7aa88107c  webtrn_study_log_formallySCORM_STU_COURSE    1451532129448    45671877  41137664  {1:11, 3:1, 4:8} 
60906970-aec7-11e5-8bd4-45b7aa88107c  system    sstable_activity    1451460704647    3751   974   {4:25} 
88aed310-ad91-11e5-8bd4-45b7aa88107c  system    hints      1451327627969    56984347  56765328  {2:3} 
3ad14f00-af6d-11e5-b1dc-45b7aa88107c  webtrn_study_log_formallySCORM_STU_COURSE    1451531937776    46696097  38827028  {1:8, 3:2, 4:9} 
84df8fb0-b00f-11e5-a211-45b7aa88107c  system    hints      1451601640491    18970740  18970740  {1:1} 
657482e0-ad33-11e5-8bd4-45b7aa88107c  system    sstable_activity    1451287196174    3701   931   {4:24} 
9cc8af70-b24a-11e5-a211-45b7aa88107c  system    sstable_activity    1451846923239    3134   773   {4:23} 
dcbe5e30-afd0-11e5-a211-45b7aa88107c  system    sstable_activity    1451574729619    3357   790   {4:23} 
b285ced0-afa0-11e5-84e3-45b7aa88107c  system    hints      1451554042941    43310718  42137761  {1:1, 2:2} 
119770e0-ad4e-11e5-8bd4-45b7aa88107c  system    hints      1451298651886    57397441  57190519  {2:3} 
f1bb37a0-b204-11e5-a211-45b7aa88107c  system    hints      1451817000986    17713746

我試着用高gc刷新節點，但是它在讀取超時時返回失敗。

集羣只接收要插入的數據。我關閉客戶端寫入並在這3天內重新啓動羣集。壓縮任務仍在積累。

來源

2015-12-31 chenatu

nodetool狀態輸出的不一致無需擔心。這是擁有大量GC的結果。在GC期間，一個節點被認爲是其他節點的閒話人。然後當你有很多GC時，節點就很快從DN切換到UN。

你必須明白什麼是採取這麼大的空間在Java堆

你有任何StatusLogger在卡桑德拉日誌？
使用nodetool cfstats，你看到任何system.hints？提示是指協調員稍後在負載較低時推遲的突變。如果你的集羣已經積累了很多提示，它會壓迫堆並導致GC。
是否有任何壓實？ nodetool compaction stats
是否沖洗所有列系列以冷卻您的集羣？ nodetool flush在每個ks和cf的所有節點上

來源

2015-12-31 11:01:23 DineMartine

嗨，我更新了問題描述。它顯示了許多正在進行的壓實任務。 – chenatu

我首先看到的是系統密鑰空間中提示的存在。這可能是造成你麻煩的原因或後果。取消暗示的切換以獲得更清晰的圖像。在cassandra.yaml文件中設置'hinted_handoff_enabled：false'。再試一次。如果問題依然存在，您應該查看壓實參數。 'nodetool getcompactionthroughput'的輸出是什麼？ – DineMartine

閱讀[本文檔]（https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_configure_compaction_t。html）來幫助你配置壓縮。 – DineMartine

Cassandra節點工具狀態在具有太多待定壓縮任務的不同節點上不一致

回答

相關問題