將csv日誌文件從windows服務器轉儲到Ubuntu VirtualBox/hadoop/hdfs

我們每天從應用程序獲取新文件，以csv的形式存儲在windows服務器中，例如c：/ program files（x86）/ webapps/apachetomcat /。 csv每個文件有不同的數據，所以有沒有hadoop組件將文件從windows服務器傳輸到hadoop hdfs，我遇到了水槽，卡夫卡，但沒有得到正確的例子，任何人都可以在這裏遮光。將csv日誌文件從windows服務器轉儲到Ubuntu VirtualBox/hadoop/hdfs

因此，每個文件都有單獨的名稱，大小可達10-20mb，每日文件數超過200個文件，一旦文件添加到Windows服務器，flume/kafka應該能夠將這些文件放入hadoop中，從HDFS導入並通過spark處理並移動到HDFS中的另一個文件夾中的處理文件中

來源

2016-11-30 Deno George

請更多詳細信息，文件大小？你希望用這些數據做什麼？ –

根據我的評論，更多細節將有助於縮小可能性，例如首先考慮將文件移動到服務器並僅創建一個bash腳本和時間表與cron。

put 

Usage: hdfs dfs -put <localsrc> ... <dst> 

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system. 

hdfs dfs -put localfile /user/hadoop/hadoopfile 
hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir 
hdfs dfs -put localfile hdfs://nn.example.com/hadoop/hadoopfile 
hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin. 
Exit Code: 

Returns 0 on success and -1 on error.

來源

2016-11-30 18:21:13

Flume是最好的選擇。水槽代理（流程）需要配置。水槽代理有3部分：

水槽源 - 水槽將尋找新文件的地方。 c：/ program files（x86）/webapps/apachetomcat/.csv在你的情況下。

水槽水槽 - 水槽將發送文件的地方。 HDFS位置在你的情況。

Flume channel - 將文件發送到接收器之前的臨時文件位置。你需要爲你的情況使用「文件通道」。

例如，點擊here。

來源

2016-11-30 21:56:04 AkashNegi

謝謝阿卡什，所以我需要flume在Windows和Linux？你能給我詳細的解釋嗎？示例 –

是的，您需要運行2個代理，如https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_installing_manually_book/content/installing_flume.html所示。如果你能以某種方式獲取日誌到一個本地的HDFS節點，那將是很棒的，但如果這不可能，那麼有一些解決方法列出http://stackoverflow.com/questions/26168820/transferring-files-from-remote-node-to -hdfs與 - 水槽。 – AkashNegi

將csv日誌文件從windows服務器轉儲到Ubuntu VirtualBox/hadoop/hdfs

回答

相關問題