How to check streaming data record already there i

2020-01-29 03:07发布

For my PoC, I am using spark-sql 2.4.x with Kafka. I have a streaming company data coming from Kafka topic. Company data Which has "company_id" ,"created_date" ,"field1" , "field2" and etc as fields. lets say this as newCompanyDataStream.

I have old company data in my parquet file. i.e. "hdfs://parquet/company" , lets say this as oldCompanyDataDf.

I need to check the new data stream from kafka (newCompanyDataStream) , for each received record of given company_id , if the data already there in the "hdfs://parquet/company" file. (oldCompanyDataDf)

How to check this?

If newCompanyDataStream "field1" and oldCompanyDataDf "field1" not same then perform tast2 ( i.e. removed oldCompanyDataDf record and add newCompanyDataStream "field1" record into oldCompanyDataDf )

If newCompanyDataStream "field2" and oldCompanyDataDf "field2" not same then perform tast2 ( i.e. removed oldCompanyDataDf record and add newCompanyDataStream "field2" record into oldCompanyDataDf )

How to implement this using spark-sql structured streaming?

For Any snippet or advice is very much thankful

标签： apache-spark apache-kafka apache-spark-sql datastax

0条回答

How to check streaming data record already there i

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间