I have end users that are going to be uploading a csv file into a bucket which will then be loaded to BigQuery. The issue is the content of the data is unreliable. i.e. it contains fields with free text that may contain linefeeds,extra commas, invalid date formats e.t.c. e.t.c.
I have a python script that will pre-process the file and write out a new one with all errors corrected.
I need to be able to automate this into the cloud. I was thinking I could load the contents of the file (it's only small) into memory and process the records then write it back out to the Bucket. I do not want to process the file locally.
Despite extensive searching I can't find how to load a file in a bucket into memory and then write it back out again.
Can anyone help ?