如何使用Spark流檢測時間序列數據中的更改

我在卡夫卡有連續的數據流。我想要統計數據流中的列值已更改的次數。如何使用Spark流檢測時間序列數據中的更改

我應該使用哪種算法？

2016-11-14 Firdousi Farozan

考查狀態轉換，如'mapWithState'其可以彙總microbatches結果。 –

當然，會去看看。謝謝， –

在Spark 2.0中使用Structured Streaming，處理流式DataFrame非常接近正常的DataFrame。在以下測試示例中，添加新批次數據時，值計數將打印到控制檯。

val wordCounts = words.groupBy("value").count() 
val query = wordCounts.writeStream 
    .outputMode("complete") 
    .format("console") 
    .start()

我們還可以創建自己的StreamSinkProvider來決定在有新的批處理數據出現時該怎麼做。

class CustomSinkProvider extends StreamSinkProvider { 
    def createSink(
        sqlContext: SQLContext, 
        parameters: Map[String, String], 
        partitionColumns: Seq[String], 
        outputMode: OutputMode): Sink = { 
    new Sink { 
     override def addBatch(batchId: Long, data: DataFrame): Unit = { 
     // Do something. 
     } 
    } 
    } 
}

然後使用以下代碼來使用CustomSinkProvider

val query = wordCounts.writeStream 
    .outputMode("complete") 
    .format(classOf[CustomSinkProvider].getCanonicalName) 
    .start()

來源

2016-11-14 10:02:23

我不認爲這回答我的問題。我的問題是關於如何計算特定領域的變化。 –

如何使用Spark流檢測時間序列數據中的更改

回答

相關問題