For my PoC, I am using spark-sql 2.4.x with Kafka. I have a streaming company data coming from Kafka topic. Company data Which has "company_id" ,"created_date" ,"field1" , "field2" and etc as fields. lets say this as newCompanyDataStream.
I have old company data in my parquet file. i.e. "hdfs://parquet/company" , lets say this as oldCompanyDataDf.
I need to check the new data stream from kafka (newCompanyDataStream) , for each received record of given company_id , if the data already there in the "hdfs://parquet/company" file. (oldCompanyDataDf)
How to check this?
If newCompanyDataStream "field1" and oldCompanyDataDf "field1" not same then perform tast2 ( i.e. removed oldCompanyDataDf record and add newCompanyDataStream "field1" record into oldCompanyDataDf )
If newCompanyDataStream "field2" and oldCompanyDataDf "field2" not same then perform tast2 ( i.e. removed oldCompanyDataDf record and add newCompanyDataStream "field2" record into oldCompanyDataDf )
How to implement this using spark-sql structured streaming?
For Any snippet or advice is very much thankful