2015-02-10 42 views
0

我們正在GCE上運行hadoop,使用HDFS默認文件系統以及從/到GCS的數據輸入/輸出。JobTracker - 高內存和本地線程使用

的Hadoop版本:1.2.1 連接器版本:com.google.cloud.bigdataoss:GCS-連接器:1.3.0 hadoop1

觀察到的行爲:JT會積累線程等待狀態,導致OOM:

2015-02-06 14:15:51,206 ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed: 
java.lang.OutOfMemoryError: unable to create new native thread 
     at java.lang.Thread.start0(Native Method) 
     at java.lang.Thread.start(Thread.java:714) 
     at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) 
     at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371) 
     at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.initialize(AbstractGoogleAsyncWriteChannel.java:318) 
     at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:275) 
     at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.create(CacheSupplementedGoogleCloudStorage.java:145) 
     at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.createInternal(GoogleCloudStorageFileSystem.java:184) 
     at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.create(GoogleCloudStorageFileSystem.java:168) 
     at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:77) 
     at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:655) 
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564) 
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545) 
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452) 
     at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:444) 
     at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1860) 
     at org.apache.hadoop.mapred.JobInProgress$3.run(JobInProgress.java:709) 
     at java.security.AccessController.doPrivileged(Native Method) 
     at javax.security.auth.Subject.doAs(Subject.java:415) 
     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) 
     at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:706) 
     at org.apache.hadoop.mapred.JobTracker.initJob(Jobenter code hereTracker.java:3890) 
     at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
     at java.lang.Thread.run(Thread.java:745) 

通過JT日誌中尋找後,我發現這些警告:

2015-02-06 14:30:17,442 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #0 from primary datanode xx.xxx.xxx.xxx:50010 
java.io.IOException: Call to /xx.xxx.xxx.xxx:50020 failed on local exception: java.io.IOException: Couldn't set up IO streams 
     at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150) 
     at org.apache.hadoop.ipc.Client.call(Client.java:1118) 
     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) 
     at com.sun.proxy.$Proxy10.getProtocolVersion(Unknown Source) 
     at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422) 
     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:414) 
     at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:392) 
     at org.apache.hadoop.hdfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:201) 
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3317) 
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2783) 
     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987) 
Caused by: java.io.IOException: Couldn't set up IO streams 
     at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642) 
     at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:205) 
     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1249) 
     at org.apache.hadoop.ipc.Client.call(Client.java:1093) 
     ... 9 more 
Caused by: java.lang.OutOfMemoryError: unable to create new native thread 
     at java.lang.Thread.start0(Native Method) 
     at java.lang.Thread.start(Thread.java:714) 
     at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:635) 
     ... 12 more 

這似乎是類似於Hadoop的缺陷報告在這裏:https://issues.apache.org/jira/browse/MAPREDUCE-5606

我通過禁用節省作業日誌到輸出路徑嘗試提出解決方案,它在缺少的日誌:)

我也jstack跑JT的代價解決了這個問題,它顯示數百個等待或TIMED_WAITING線程因此:

pool-52-thread-1" prio=10 tid=0x00007feaec581000 nid=0x524f in Object.wait() [0x00007fead39b3000] 
    java.lang.Thread.State: TIMED_WAITING (on object monitor) 
     at java.lang.Object.wait(Native Method) 
     - waiting on <0x000000074d86ba60> (a java.io.PipedInputStream) 
     at java.io.PipedInputStream.read(PipedInputStream.java:327) 
     - locked <0x000000074d86ba60> (a java.io.PipedInputStream) 
     at java.io.PipedInputStream.read(PipedInputStream.java:378) 
     - locked <0x000000074d86ba60> (a java.io.PipedInputStream) 
     at com.google.api.client.util.ByteStreams.read(ByteStreams.java:181) 
     at com.google.api.client.googleapis.media.MediaHttpUploader.setContentAndHeadersOnCurrentReque 
st(MediaHttpUploader.java:629) 
     at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader. 
java:409) 
     at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336) 
     at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr 
actGoogleClientRequest.java:419) 
     at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(Abstr 
actGoogleClientRequest.java:343) 
     at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogl 
eClientRequest.java:460) 
     at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.run(AbstractGo 
ogleAsyncWriteChannel.java:354) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
     at java.lang.Thread.run(Thread.java:745) 
    Locked ownable synchronizers: 
     - <0x000000074d864918> (a java.util.concurrent.ThreadPoolExecutor$Worker) 

JT看起來很難通過GCS連接器與GCS保持通信。

請指教,

謝謝

+0

你碰巧知道你從哪裏獲取這個gcs-connector-1.3.0-hadoop1.jar?你可以用「hadoop fs -stat gs:// foo」來驗證你的gcs-connector版本嗎?它應該打印出如「15/02/10 18:16:13 INFO gcs.GoogleHadoopFileSystemBase:GHFS版本:1.3.0-hadoop1」。 – 2015-02-10 18:18:28

+0

> hadoop fs -stat gs:// zulily 2014-07-01 17:19:42 – ichekrygin 2015-02-10 18:26:14

+0

另外,我們使用的是由bdutil安裝的gcs-connector '-rw-r - r-- 1 root root 4451217 Jun 6 2014 gcs-connector-1.2.6-hadoop1.jar' – ichekrygin 2015-02-10 18:27:42

回答

0

目前,在GCS連接器Hadoop的每一個開放FSDataOutputStream消耗線程,直到它的關閉,因爲一個單獨的線程需要運行「可恢復」 HTTPRequests的同時, OutputStream的用戶間歇地寫入字節。在大多數情況下(例如在單個Hadoop任務中),只有一個長壽命的輸出流,並且可能還有一些用於編寫小型元數據/標記文件的較短壽命的輸出流。

一般來說,您遇到的OOM有兩種可能的原因:

  1. 您有很多排隊工作;每個提交的作業都包含一個未關閉的OutputStream,因此會消耗「等待」線程。但是,由於你提到你只需要排隊〜10個工作,這不應該是根本原因。
  2. 某些東西導致PrintWriter對象「泄漏」,最初在logSubmitted中創建並添加到fileManager。通常情況下,終端事件(如logFinished將在通過markCompleted將它們從地圖中移除之前正確關閉()所有PrintWriters,但從理論上講,它們可能是這裏或那裏的錯誤,可能會導致OutputStream中的一個泄漏而不close()'d例如,雖然我還沒有機會來驗證這一說法,似乎IOException異常試圖做類似logMetaInfo將「removeWriter」 without closing it

我驗證過,至少在正常情況下, OutputStream似乎正確關閉,並且我的示例JobTracker在成功運行了大量作業之後顯示一個乾淨的jstack。

TL; DR:有一些工作t理解爲什麼某些資源可能泄漏並最終阻止創建必要的線程。在此期間,您應該考慮將hadoop.job.history.user.location更改爲某個HDFS位置,以便在沒有將它們放在GCS上的情況下保留作業日誌。