How to check streaming data record already there i

2020-01-29 02:38发布

问题:

For my PoC, I am using spark-sql 2.4.x with Kafka. I have a streaming company data coming from Kafka topic. Company data Which has "company_id" ,"created_date" ,"field1" , "field2" and etc as fields. lets say this as newCompanyDataStream.

I have old company data in my parquet file. i.e. "hdfs://parquet/company" , lets say this as oldCompanyDataDf.

I need to check the new data stream from kafka (newCompanyDataStream) , for each received record of given company_id , if the data already there in the "hdfs://parquet/company" file. (oldCompanyDataDf)

How to check this?

If newCompanyDataStream "field1" and oldCompanyDataDf "field1" not same then perform tast2 ( i.e. removed oldCompanyDataDf record and add newCompanyDataStream "field1" record into oldCompanyDataDf )

If newCompanyDataStream "field2" and oldCompanyDataDf "field2" not same then perform tast2 ( i.e. removed oldCompanyDataDf record and add newCompanyDataStream "field2" record into oldCompanyDataDf )

How to implement this using spark-sql structured streaming?

For Any snippet or advice is very much thankful