I've got a few clusters that have now been running for a month or so and am finding that temporary storage is being entirely gobbled up by Service Fabric log files. On a sweet fleet of F1 VMs where there is only 16GB of local storage I am just about out of space, a few of them are now down to 30MB, yes mega-bytes of storage (where less than 1GB is consumed by my application in all its versions).
In looking at the disk usage on the cluster VMs I can see clearly that the SvcFab\Log and SvcFab\ReplicatorLog folders are consuming over 90% of available space. Surely the SF can better handle this. Is there something I can toggle or configure to get it to flush some of it's data? Or better yet move it up to blob or table storage?
This must be an issue for others. What are others doing? And Service Fabric team, what is best practice for this?
So no useful help on this one. I resorting to tearing down that cluster and rebuilding it. Fortunately for me the cluster was one of a pair and I was able to simply redirect all traffic via TrafficMgr to the other cluster while I destroyed the faulty one and created a fresh one.
Pretty disconcerting to me. Had I not had this redundancy it would have been a rather huge problem. :-(
I am not sure if below is considered as tearing down the cluster ! I tested this on a stateless service on a dummy Service fabric app
Service fabric that we deployed on standard_DS1_V2 was suffering quorum loss and health analysis service also failed because of insufficient disk space. Instead of tearing down the cluster, I stopped the vm scale set using ARM power shell
then went to Azure Portal > Resource Groups > Virtual Machine Scale Set > Scaling to bump the SKU to Standard_D1_V2 and started the VM Scale Set
and redeployed the service fabric app and it works as expected !
If replicator log is full its implying you are using F1s for data storage ... 16GB is not a lot for your data storage and you may be better of breaking the app into processing / storage services with different sets .
Not an expert on how SF stores things ( I will leave that and trimming to others but there is not a lot of info out there) but if its like similar solutions than the replicator log has part of your data and it reduces when safe.. Also rather than F1 you may be better of using F2 and F4 since they have *2 or *4 the IO and cores your not loosing anything but gaining extra storage .. and it means less replication ( unless your doing lots of partitioning ) .