Despite the fact that I'm using withWatermark()
, I'm getting the following error message when I run my spark job:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
From what I can see in the programming guide, this exactly matches the intended usage (and the example code). Does anyone know what might be wrong?
Thanks in advance!
Relevant Code (Java 8, Spark 2.2.0):
StructType logSchema = new StructType()
.add("timestamp", TimestampType)
.add("key", IntegerType)
.add("val", IntegerType);
Dataset<Row> kafka = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topics)
.load();
Dataset<Row> parsed = kafka
.select(from_json(col("value").cast("string"), logSchema).alias("parsed_value"))
.select("parsed_value.*");
Dataset<Row> tenSecondCounts = parsed
.withWatermark("timestamp", "10 minutes")
.groupBy(
parsed.col("key"),
window(parsed.col("timestamp"), "1 day"))
.count();
StreamingQuery query = tenSecondCounts
.writeStream()
.trigger(Trigger.ProcessingTime("10 seconds"))
.outputMode("append")
.format("console")
.option("truncate", false)
.start();
The problem is in
parsed.col
. Replacing it withcol
will fix the issue. I would suggest always usingcol
function instead ofDataset.col
.Dataset.col
returnsresolved column
whilecol
returnsunresolved column
.parsed.withWatermark("timestamp", "10 minutes")
will create a new Dataset with new columns with the same names. The watermark information is attached thetimestamp
column in the new Dataset, notparsed.col("timestamp")
, so the columns ingroupBy
don't have watermark.When you use unresolved columns, Spark will figure out the correct columns for you.