如何將文件保存在Hadoop中使用Python

我開始學習Hadoop的，但是，我需要使用Python了很多文件保存到它。我似乎無法弄清楚我做錯了什麼。誰能幫我這個？

以下是我的代碼。我認爲HDFS_PATH是正確的，因爲我沒有在安裝時在設置中更改它。 pythonfile.txt在我的桌面上（通過命令行運行的python代碼也是如此）。

代碼：

import hadoopy 
import os 
hdfs_path ='hdfs://localhost:9000/python' 

def main(): 
    hadoopy.writetb(hdfs_path, [('pythonfile.txt',open('pythonfile.txt').read())]) 

main()

輸出當我運行上面的代碼我得到的是在Python本身的目錄。

iMac-van-Brian:desktop Brian$ $HADOOP_HOME/bin/hadoop dfs -ls /python 

DEPRECATED: Use of this script to execute hdfs command is deprecated. 
Instead use the hdfs command for it. 

14/10/28 11:30:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
-rw-r--r-- 1 Brian supergroup  236 2014-10-28 11:30 /python

來源

2014-10-28 user3671459

我有你寫到一個名爲「/蟒蛇」，而你想讓它在該文件所在的目錄文件的感覺

什麼呢

hdfs dfs -cat /python

顯示你？

如果它顯示該文件的內容，所有你需要做的是編輯hdfs_path包含文件名（你應該刪除/蟒蛇先用-rm）否則，使用pydoop（PIP安裝pydoop），並做到這一點：

import pydoop.hdfs as hdfs 

from_path = '/tmp/infile.txt' 
to_path ='hdfs://localhost:9000/python/outfile.txt' 
hdfs.put(from_path, to_path)

來源

2014-10-28 14:16:16 Legato

HDFS DFS -cat /蟒蛇顯示：'14/10/28 21點38分29秒WARN util.NativeCodeLoader：無法加載原生的Hadoop庫你的平臺...在適用的地方使用內建的java類 SEQ/org.apache.hado op.typedbytes.TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritable * org.apache.hadoop.io.compress.DefaultCodecOS ?? Z 4-1'？Nc7R ?? pythonfile.txtx？c''？cɒ·ΔT ?? T ??̒ ？?? ?? <。?? \t？' – user3671459 2014-10-28 20:38:59

確實如我所說，你已經將數據寫入根目錄中名爲「python」的*文件*。當你'貓'它時，你會看到你寫入的sequenceFile的內容（writeetb寫一個sequenceFile，它是二進制的 - 不像文本文件）。我在hadoopy文檔中沒有看到寫入文本文件的方式，因此請使用pydoop，或者繼續寫入（並讀取sequenceFiles）。無論如何，您需要將文件名添加到路徑中，如上面的答案中所述。 – Legato 2014-10-29 05:43:53

我發現這個答案here：

import subprocess 

def run_cmd(args_list): 
     """ 
     run linux commands 
     """ 
     # import subprocess 
     print('Running system command: {0}'.format(' '.join(args_list))) 
     proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE) 
     s_output, s_err = proc.communicate() 
     s_return = proc.returncode 
     return s_return, s_output, s_err 

#Run Hadoop ls command in Python 
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', 'hdfs_file_path']) 
lines = out.split('\n') 


#Run Hadoop get command in Python 
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-get', 'hdfs_file_path', 'local_path']) 


#Run Hadoop put command in Python 
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-put', 'local_file', 'hdfs_file_path']) 


#Run Hadoop copyFromLocal command in Python 
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyFromLocal', 'local_file', 'hdfs_file_path']) 

#Run Hadoop copyToLocal command in Python 
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-copyToLocal', 'hdfs_file_path', 'local_file']) 


hdfs dfs -rm -skipTrash /path/to/file/you/want/to/remove/permanently 
#Run Hadoop remove file command in Python 
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', 'hdfs_file_path']) 
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-skipTrash', 'hdfs_file_path']) 


#rm -r 
#HDFS Command to remove the entire directory and all of its content from #HDFS. 
#Usage: hdfs dfs -rm -r <path> 

(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', 'hdfs_file_path']) 
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-rm', '-r', '-skipTrash', 'hdfs_file_path']) 




#Check if a file exist in HDFS 
#Usage: hadoop fs -test -[defsz] URI 


#Options: 
#-d: f the path is a directory, return 0. 
#-e: if the path exists, return 0. 
#-f: if the path is a file, return 0. 
#-s: if the path is not empty, return 0. 
#-z: if the file is zero length, return 0. 
#Example: 

#hadoop fs -test -e filename 

hdfs_file_path = '/tmpo' 
cmd = ['hdfs', 'dfs', '-test', '-e', hdfs_file_path] 
ret, out, err = run_cmd(cmd) 
print(ret, out, err) 
if ret: 
    print('file does not exist')

來源

2018-01-17 14:04:55

如何將文件保存在Hadoop中使用Python

回答

相關問題