Spark Structured Stream Process - Watermark

less than 1 minute read

Spark Structured Streaming Watermark Concept.

Watermark is a fundamental concept on processing streaming data. Watermark will decide how much data will be frozen and safe to aggregate information.
Watermark in Spark consists of two values which are the max seen event time and threshold on a specific processing time. Threshold value is a delay time which how much we can accumulate data to aggregate.

Because of a threshold value, it will decide how frequently we can produce the results.

The following Youtube link is explaining how watermark works on streaming process for Spark.

https://www.youtube.com/watch?v=XjlKGvUt2dY

There is a difference between Spark watermark and Apache Beam.

Spark watermark is a global but Apache Beam watermark applies on each transform. And Spark watermark is the max event time but Apache Beam is the oldest event time.

Categories:

Updated: