2017-03-17 36 views
1

我們計劃使用apache kafka構建一個實時監控系統。總體思路是將數據從多個數據源推送到kafka並執行數據質量檢查。我有這個架構使用Kafka實現多個數據源的流式傳輸

  1. 幾個問題,什麼是流媒體從多個來源主要包括Java應用程序,Oracle數據庫數據的最佳方法,REST API的,日誌文件到Apache卡夫卡?注意每個客戶端部署都包含每個這樣的數據源。因此,將數據推送到kafka的數據源的數量將等於客戶數* x,其中x是我列出的數據源類型。理想情況下,推送方式最適合,而不是拉式方法。在拉方法中,目標系統必須配置各種不同源系統的證書,這是不實際的
  2. 我們如何處理故障?
  3. 我們如何對傳入消息執行數據質量檢查?對於例如如果某個消息不具備所有必需的屬性,則可能會丟棄該消息,並且可能會提醒維護團隊檢查。

請讓我知道你的專家意見。謝謝 !

回答

1

我覺得這裏最好的辦法是用卡夫卡連接:link 但它是一個拉的方法: Kafka Connect sources are pull-based for a few reasons. First, although connectors should generally run continuously, making them pull-based means that the connector/Kafka Connect decides when data is actually pulled, which allows for things like pausing connectors without losing data, brief periods of unavailability as connectors are moved, etc. Second, in distributed mode the tasks that pull data may need to be rebalanced across workers, which means they won't have a consistent location or address. While in standalone mode you could guarantee a fixed network endpoint to work with (and point other services at), this doesn't work in distributed mode where tasks can be moving around between workers.阿雯

+0

我拉的基礎卡夫卡連接方法的優點同意,但考慮到連接器會需要從依賴於客戶端的數量多源拉。我們如何處理這樣配置在連接器源憑據,頻繁增加和客戶的缺失等事物的來源平臺的管理似乎是一個挑戰。我們如何有效地處理這個問題? –

相關問題