Publication Type
Conference Paper
Journal Name
Proceedings of the VLDB Endowment
Publication Date
Page Numbers
3135 to 3147
Volume
14
Issue
1
Conference Name
47th International Conference on Very Large Data Bases (VLDB)
Conference Location
Copenhagen, Denmark
Conference Sponsor
VLDB Foundation
Conference Date
-
Abstract
Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks, a key tool for reasoning about temporal completeness in infinite streams.
First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches:
• Computing a single correct answer, as in notifications.
• Reasoning about a lack of data, as in dip detection.
• Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models.
• Safely and punctually garbage collecting obsolete inputs and intermediate state.
• Surfacing a reliable signal of overall pipeline health.
Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.