0

在Dataproc我設置一個PySpark集羣1主節點和2名工人。在桶中我有文件的子目錄的目錄。Dataproc PySpark工人無權使用gsutil

在Datalab筆記本我跑

import subprocess 
all_parent_direcotry = subprocess.Popen("gsutil ls gs://parent-directories ",shell=True,stdout=subprocess.PIPE).stdout.read() 

這給了我所有的子目錄,沒有問題。

然後我希望gsutil ls所有子目錄中的文件,所以在主節點我:

def get_sub_dir(path): 
    import subprocess 
    p = subprocess.Popen("gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE) 
    return p.stdout.read(), p.stderr.read() 

和運行get_sub_dir(sub-directory),這給沒有問題的所有文件。

然而,

sub_dir = sc.parallelize([sub-directory]) 
sub_dir.map(get_sub_dir).collect() 

給我:

Traceback (most recent call last): 
    File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 99, in <module> 
    main() 
    File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 30, in main 
    project, account = bootstrapping.GetActiveProjectAndAccount() 
    File "/usr/lib/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 205, in GetActiveProjectAndAccount 
    project_name = properties.VALUES.core.project.Get(validate=False) 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1373, in Get 
    required) 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1661, in _GetProperty 
    value = _GetPropertyWithoutDefault(prop, properties_file) 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1699, in _GetPropertyWithoutDefault 
    value = callback() 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 222, in GetProject 
    return c_gce.Metadata().Project() 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 203, in Metadata 
    _metadata_lock.lock(function=_CreateMetadata, argument=None) 
    File "/usr/lib/python2.7/mutex.py", line 44, in lock 
    function(argument) 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 202, in _CreateMetadata 
    _metadata = _GCEMetadata() 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 59, in __init__ 
    self.connected = gce_cache.GetOnGCE() 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 141, in GetOnGCE 
    return _SINGLETON_ON_GCE_CACHE.GetOnGCE(check_age) 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 81, in GetOnGCE 
    self._WriteDisk(on_gce) 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 113, in _WriteDisk 
    with files.OpenForWritingPrivate(gce_cache_path) as gcecache_file: 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 715, in OpenForWritingPrivate 
    MakeDir(full_parent_dir_path, mode=0700) 
    File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 115, in MakeDir 
    (u'Please verify that you have permissions to write to the parent ' 
googlecloudsdk.core.util.files.Error: Could not create directory [/home/.config/gcloud]: Permission denied. 

Please verify that you have permissions to write to the parent directory. 

檢查後,與whoami工作節點上,它顯示yarn

所以問題是,如何授權yarn使用gsutil,或者是否有任何其他方式從Dataproc PySpark Worker節點訪問存儲區?

+0

在您的SO問題中,您是在'get_sub_dir'函數中寫入'gcloud ls gs://'而不是'gsutil ls gs://'嗎? –

+0

Thx,這是一個錯字和更新。問題依然存在。 – Sun

回答

0

的CLI着眼於當前的主目錄用於放置緩存的證書文件時,它將從元數據服務的令牌的位置。在googlecloudsdk/core/config.py相關的代碼如下所示:

def _GetGlobalConfigDir(): 
    """Returns the path to the user's global config area. 

    Returns: 
    str: The path to the user's global config area. 
    """ 
    # Name of the directory that roots a cloud SDK workspace. 
    global_config_dir = encoding.GetEncodedValue(os.environ, CLOUDSDK_CONFIG) 
    if global_config_dir: 
    return global_config_dir 
    if platforms.OperatingSystem.Current() != platforms.OperatingSystem.WINDOWS: 
    return os.path.join(os.path.expanduser('~'), '.config', 
         _CLOUDSDK_GLOBAL_CONFIG_DIR_NAME) 

的在紗線容器運行的東西,儘管被運行爲用戶yarn,在那裏,如果你只是運行sudo su yarn你會看到~決心/var/lib/hadoop-yarn上Dataproc節點,紗實際上傳播yarn.nodemanager.user-home-dir作爲容器的主目錄,而這種默認爲/home/。出於這個原因,即使你可以sudo -u yarn gsutil ...,它不行爲的方式在紗線容器的gsutil一樣的,自然只有root能夠在基/home/目錄下創建目錄。

長話短說,你有兩個選擇:

  1. 在你的代碼,你gsutil語句之前添加HOME=/var/lib/hadoop-yarn

實施例:

p = subprocess.Popen("HOME=/var/lib/hadoop-yarn gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE) 
  • 創建集羣時,指定紗線性能。
  • 例子:

    gcloud dataproc clusters create --properties yarn:yarn.nodemanager.user-home-dir=/var/lib/hadoop-yarn ... 
    

    對於現有的集羣,你也可以手動的配置添加到/etc/hadoop/conf/yarn-site.xml上所有的工人,然後重新啓動工人機(或只是運行sudo systemctl restart hadoop-yarn-nodemanager.service),但可以是無法在所有工作節點上手動運行。