爲Elastic MapReduce配置外部數據源

我們希望在當前數據庫之上使用Amazon Elastic MapReduce（我們在EC2上使用Cassandra）。縱觀亞馬遜EMR常見問題，它應該是可能的： Amazon EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon S3?爲Elastic MapReduce配置外部數據源

但是，創建一個新的工作流時，我們只能配置一個S3存儲作爲輸入數據源。

任何想法/樣品如何做到這一點？

謝謝！

P.S .:我見過這個問題How to use external data with Elastic MapReduce，但答案並沒有真正解釋如何做到/配置它，只是這是可能的。

嘗試使用scp將文件複製到您的EMR比如：

my-desktop-box$ scp mylocaldatafile my-emr-node:/path/to/local/file

（或使用ftp，或wget，或curl，或者其他任何你想要的）

然後登錄到您的EMR實例ssh並加載到hadoop：

my-desktop-box$ ssh my-emr-node 
    my-emr-node$ hadoop fs -put /path/to/local/file /path/in/hdfs/file

2013-03-27 05:53:33 Christopher

是如何你處理數據？ EMR只是管理hadoop。你仍然需要編寫某種程序。

如果您正在編寫Hadoop Mapreduce作業，那麼您正在編寫Java，並且可以使用Cassandra apis來訪問它。

如果您想使用類似配置單元的東西，您需要編寫一個Hive存儲處理程序來使用由Cassandra支持的數據。

2013-06-24 05:46:22 prestomation

回答