I am ingesting data from a file source using structured streaming. I have a checkpoint setup and it works correctly as far as I can tell except I don't understand what will happen in a couple situations. If my streaming app runs for a long time will the checkpoint files just continue to become larger forever or is it eventually cleaned up. And does it matter if it is never cleaned up? It seems that eventually it would become large enough that it would take a long time for the program to parse.
My other question is when I manually remove or alter the checkpoint folder, or change to a different checkpoint folder no new files are ingested. The files are recognized and are added to the checkpoint, but the file is not actually ingested. This has me worried that if somehow the checkpoint folder is altered my ingestion will screw up. I haven't been able to find much information on what the correct procedure is in these situations.
After 6 months of running my Structured Streaming app I found some answer I think. The checkpoint files compact together every 10 executions and do continue to grow. Once these compacted files got large ~2gb, there was a noticeable decrease in processing time. So every 10 executions had approximately a 3-5 minute delay. I cleaned up the checkpoint files therefore starting over, and execution time was instantly back to normal.
For my second question, I found that there are essentially two checkpoint locations. The checkpoint folder that is specified and another _spark_metadata in the table directory. Both need to be removed to start over with checkpoint.
Structured Streaming keeps a background thread which is responsible for deleting snapshots and deltas of your state, so you shouldn't be concerned about it unless your state is really large and the amount of space you have is small, in which case you can configure the retrained deltas/snapshots Spark stores.
I'm not really sure what you mean here, but you should only remove checkpointed data in special cases. Structured Streaming allows you to keep state between version upgrades as long as the stored data type is backwards compatible. I don't really see a good reason for altering the location of your checkpoint or deleting the files manually unless something bad happened.