2016-03-09 155 views
2

我正在使用curator框架連接到動物園管理員服務器,但遇到奇怪的DNS解析問題。這裏是jstack轉儲線程,Java DNS解析永遠掛起

#21 prio=5 os_prio=0 tid=0x0000000001888800 nid=0x3a46 runnable [0x00007f25e69f3000] 
java.lang.Thread.State: RUNNABLE 
    at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) 
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) 
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) 
    at java.net.InetAddress.getAllByName0(InetAddress.java:1276) 
    at java.net.InetAddress.getAllByName(InetAddress.java:1192) 
    at java.net.InetAddress.getAllByName(InetAddress.java:1126) 
    at org.apache.zookeeper.client.StaticHostProvider.resolveAndShuffle(StaticHostProvider.java:117) 
    at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:81) 
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1096) 
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1006) 
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:804) 
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:679) 
    at com.netflix.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:72) 
    - locked <0x00000000fd761f40> (a com.netflix.curator.HandleHolder$1) 
    at com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:46) 
    at com.netflix.curator.ConnectionState.reset(ConnectionState.java:122) 
    at com.netflix.curator.ConnectionState.start(ConnectionState.java:95) 
    at com.netflix.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:137) 
    at com.netflix.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:167) 

線程似乎被卡在本地方法和永遠不會返回。它也是非常隨機的發生,所以一直無法重現。有任何想法嗎?

+0

錯誤,修復你的DNS? – EJP

+0

不確定它是否有DNS問題。 –

+0

檢查這一個:http://stackoverflow.com/questions/1608503/domain-name-resolution-not-working-in-java-applications-on-ubuntu64-9-04-machine –

回答

3

我們也試圖解決這個問題。看起來這是由於glibc的錯誤:https://bugzilla.kernel.org/show_bug.cgi?id=99671或內核錯誤:https://bugzilla.redhat.com/show_bug.cgi?id=1209433取決於誰你問;)

另外值得一讀:https://access.redhat.com/security/cve/cve-2013-7423https://alas.aws.amazon.com/ALAS-2015-617.html

爲了證實這的確是這樣的連接GDB到java程序:

gdb --pid <JavaProcessPid> 
從GDB

則:

info threads 

找到一個線程,確實recvmsg:

thread <HangingThreadId> 

然後

backtrace 

,如果你看到這樣的事情,那麼你知道的glibc /內核升級將幫助:

#0 0x00007fc726ff27cd in recvmsg() from /lib64/libc.so.6 
#1 0x00007fc727018765 in make_request() from /lib64/libc.so.6 
#2 0x00007fc727018b9a in __check_pf() from /lib64/libc.so.6 
#3 0x00007fc726fdbd57 in getaddrinfo() from /lib64/libc.so.6 
#4 0x00007fc706dd9635 in Java_java_net_Inet6AddressImpl_lookupAllHostAddr() from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-0.b17.el6_7.x86_64/jre/lib/amd64/libnet.so 

更新:看起來像內核勝。請參閱此線程:http://www.gossamer-threads.com/lists/linux/kernel/2264958瞭解詳情。 也有是驗證你的系統是由你可以使用這個簡單的程序內核問題的影響的工具:https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473

驗證:

curl -o pf_dump.c https://gist.githubusercontent.com/stevenschlansker/6ad46c5ccb22bc4f3473/raw/22cfe72f6708de1e3468c1e0fa3888aafae42db4/pf_dump.c 
gcc pf_dump.c -pthread -o pf_dump 
./pf_dump 

如果輸出是:

[26170] glibc: check_pf: netlink socket read timeout 
Aborted 

然後系統受到影響。如果輸出是這樣的:

exit success [7618] exit success [7265] exit success 

那麼系統就OK了。 在AWS上下文中,使用新內核將AMI升級到(2016.3.2)似乎已解決了問題。

+0

請不要寫只有鏈接的答案。請將其作爲評論,或者在文本中包含必要的部分。 –

+0

是的,glibc升級修復了這個問題!我忘了更新這個線程。 –

+0

謝謝@Jacek Tomaka。我認爲'curl -O'應該是'curl -o'。 – seanf