-->

Best way ho to validate ingested data

2019-08-18 04:27发布

问题:

I am ingesting data daily from various external sources like GA, scrapers, Google BQ, etc. I store created CSV file into HDFS, create stage table from it and then append it to historical table in Hadoop. Can you share some best practices how to valide new data with historical one? Like for example compare row count of actual data with average of last 10 days or someting like that. Is there any prepared solution in spark or something?

Thanks for advices.