2017-10-19 78 views
0

我有以下文件夾中HDFS:使用的GroupBy而從HDFS複製到S3到一個文件夾中的文件合併

hdfs://x.x.x.x:8020/Air/BOOK/AE/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/AE/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/BH/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/IN/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/IN/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/KW/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/KW/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/ME/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/OM/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/Others/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/QA/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/QA/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/SA/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/BOOK/SA/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/AE/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/AE/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/BH/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/BH/INT/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/IN/DOM/20171001/2017100101 
hdfs://x.x.x.x:8020/Air/SEARCH/IN/INT/20171001/2017100101 

每個文件夾都有近50個文件中it.My目的是要合併的所有文件在一個文件夾內從HDFS複製S3上的單個文件。我遇到的問題是與GROUPBY option.I正則表達式嘗試這樣做,這似乎並沒有工作:

s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/ --groupBy '.*/(\w+)/(\w+)/(\w+)/.*' --outputCodec lzo 

命令的工作本身,而是我不每個文件夾中獲得文件合併成一個文件,這讓我相信這個問題是與我的正則表達式。

回答

0

我想通了這一點我自己only..the正確的正則表達式是

.*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.* 

和命令合併和副本:

s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/ --groupBy '.*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.*' --outputCodec lzo 
相關問題