我該如何調試爲什麼在GC期間停止線程需要這麼長時間？

我運行一個卡桑德拉實例與我該如何調試爲什麼在GC期間停止線程需要這麼長時間？

$ java -version 
java version "1.8.0_121" 
Java(TM) SE Runtime Environment (build 1.8.0_121-b34) 
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b34, mixed mode)

和下面的垃圾收集選項

-Xms20000m 
-Xmx20000m 
-Xmn10000m 
-Xss256k 
-XX:+UseG1GC 
-XX:MaxGCPauseMillis=1000 
-XX:GCTimeRatio=97 
-XX:MaxTenuringThreshold=2 
-XX:+UseTLAB 
-XX:+ResizeTLAB 
-XX:+PerfDisableSharedMem 
-XX:+PrintGCDetails 
-XX:+PrintGCDateStamps 
-XX:+PrintHeapAtGC 
-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintAdaptiveSizePolicy 
-Xloggc:/dev/shm/cassandra-gc.log

在垃圾收集週期，JVM記錄以下

2017-08-15T17:09:08.506+0000: 71712.670: [GC pause (G1 Evacuation Pause) (young) 
Desired survivor size 658505728 bytes, new threshold 2 (max 2) 
- age 1: 26818496 bytes, 26818496 total 
- age 2: 14940744 bytes, 41759240 total 
71712.670: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 222556, predicted base time: 50.67 ms, remaining time: 949.33 ms, target pause time: 1000.00 ms] 
71712.670: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 1242 regions, survivors: 8 regions, predicted young region time: 12.00 ms] 
71712.670: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 1242 regions, survivors: 8 regions, old: 0 regions, predicted pause time: 62.68 ms, target pause time: 1000.00 ms] 
71712.736: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC overhead higher than threshold after GC, recent GC overhead: 1.39 %, threshold: 1.02 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)] 
, 0.0662411 secs] 
    [Parallel Time: 57.5 ms, GC Workers: 33] 
     [GC Worker Start (ms): Min: 71712670.9, Avg: 71712671.2, Max: 71712671.4, Diff: 0.4] 
     [Ext Root Scanning (ms): Min: 6.4, Avg: 7.0, Max: 10.2, Diff: 3.8, Sum: 232.3] 
     [Update RS (ms): Min: 34.0, Avg: 37.0, Max: 38.9, Diff: 4.9, Sum: 1221.2] 
     [Processed Buffers: Min: 27, Avg: 48.5, Max: 66, Diff: 39, Sum: 1599] 
     [Scan RS (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.4, Sum: 5.1] 
     [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1] 
     [Object Copy (ms): Min: 11.3, Avg: 12.6, Max: 12.9, Diff: 1.7, Sum: 415.0] 
     [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.4] 
     [Termination Attempts: Min: 1, Avg: 2.1, Max: 4, Diff: 3, Sum: 68] 
     [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 4.5] 
     [GC Worker Total (ms): Min: 56.7, Avg: 56.9, Max: 57.3, Diff: 0.7, Sum: 1878.5] 
     [GC Worker End (ms): Min: 71712728.0, Avg: 71712728.1, Max: 71712728.3, Diff: 0.3] 
    [Code Root Fixup: 0.5 ms] 
    [Code Root Purge: 0.0 ms] 
    [Clear CT: 1.1 ms] 
    [Other: 7.1 ms] 
     [Choose CSet: 0.0 ms] 
     [Ref Proc: 2.7 ms] 
     [Ref Enq: 0.0 ms] 
     [Redirty Cards: 0.8 ms] 
     [Humongous Register: 0.1 ms] 
     [Humongous Reclaim: 0.0 ms] 
     [Free CSet: 1.5 ms] 
    [Eden: 9936.0M(9936.0M)->0.0B(9920.0M) Survivors: 64.0M->80.0M Heap: 18.2G(19.5G)->8718.4M(19.5G)] 
Heap after GC invocations=8876 (full 0): 
garbage-first heap total 20480000K, used 8927594K [0x00000002de000000, 0x00000002de804e20, 0x00000007c0000000) 
    region size 8192K, 10 young (81920K), 10 survivors (81920K) 
Metaspace  used 44399K, capacity 44978K, committed 45184K, reserved 1089536K 
    class space used 4591K, capacity 4729K, committed 4760K, reserved 1048576K 
} 
[Times: user=0.88 sys=0.24, real=0.07 secs] 
2017-08-15T17:09:08.573+0000: 71712.737: Total time for which application threads were stopped: 4.1542217 seconds, Stopping threads took: 4.0814458 seconds

我我會提請注意最後一行

針對應用程序的線程被停止的總時間：4.1542217秒，停止線了：4.0814458秒

我無法從任何的印刷GC細節知道爲什麼停止線程花了這麼長時間。

其他（長ISH）暫停有平凡低線程停止時間

2017-08-15T17:09:26.646+0000: 71730.810: Total time for which application threads were stopped: 0.1165019 seconds, Stopping threads took: 0.0005363 seconds 
2017-08-15T17:10:36.982+0000: 71801.146: Total time for which application threads were stopped: 0.1444364 seconds, Stopping threads took: 0.0006902 seconds 
2017-08-15T17:10:42.091+0000: 71806.255: Total time for which application threads were stopped: 0.2288663 seconds, Stopping threads took: 0.0064196 seconds 
2017-08-15T17:10:52.601+0000: 71816.765: Total time for which application threads were stopped: 0.2781895 seconds, Stopping threads took: 0.0005429 seconds 
2017-08-15T17:10:56.992+0000: 71821.156: Total time for which application threads were stopped: 0.1380046 seconds, Stopping threads took: 0.0006240 seconds

應用程序有至少1000個線程，但它沒有任何意義，我認爲阻止他們將採取GC事件的98％持續時間。

我首先假定它是用了一個線程到達一個還原點的時間，所以我嘗試添加

-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=500

但事件記錄沒有與長GC事件一致（除非線程都需要499ms才能達到他們的安全點...）

還有什麼可以延遲線程停止時間？

請注意，無論使用CMS還是G1以及我是否使用write the GC logs to disk (SSD) or to memory，都會出現這些暫停。

來源

2017-08-15 Savior

@ the8472謝謝。這些統計數據顯示了很長的「同步」時間，這讓我認爲我的問題是[this]（http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057003）。 – Savior

您可以添加-XX:+PrintSafepointStatistics–XX:PrintSafepointStatisticsCount=1以啓用對安全點的詳細診斷。

-XX:+PrintJNIGCStalls是另一個診斷option，可能會解決一些問題。

JNIGCStalls不計入SafepointTimeout。

其他原因一般可能超載系統。我建議你檢查「負荷平均值」中的峯值是否與異常的GC相關。儘管「負荷平均」指標本身具有quirks。

如果你在虛擬化環境中，零星的系統凍結可能是一種常態。這種凍結還可以被認爲是「負載平均」OS度量中的尖峯。

來源

2017-08-16 01:29:56

我該如何調試爲什麼在GC期間停止線程需要這麼長時間？

回答

相關問題