I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String. I want to pass a pattern of JSON text files on gs://my_bucket/*/*.json, have a ParDo then process each and every file entirely.
What's the best approach to it?
I am going to give the most generally useful answer, even though there are special cases [1] where you might do something different.
I think what you want to do is to define a new subclass of
FileBasedSource
and useRead.from(<source>)
. Your source will also include a subclass ofFileBasedReader
; the source contains the configuration data and the reader actually does the reading.I think a full description of the API is best left to the Javadoc, but I will highlight the key override points and how they relate to your needs:
FileBasedSource#isSplittable()
you will want to override and returnfalse
. This will indicate that there is no intra-file splitting.FileBasedSource#createForSubrangeOfFile(String, long, long)
you will override to return a sub-source for just the file specified.FileBasedSource#createSingleFileReader()
you will override to produce aFileBasedReader
for the current file (the method should assume it is already split to the level of a single file).To implement the reader:
FileBasedReader#startReading(...)
you will override to do nothing; the framework will already have opened the file for you, and it will close it.FileBasedReader#readNextRecord()
you will override to read the entire file as a single element.[1] One example easy special case is when you actually have a small number of files, you can expand them prior to job submission, and they all take the same amount of time to process. Then you can just use
Create.of(expand(<glob>))
followed byParDo(<read a file>)
.A much simpler method is to generate the list of filenames and write a function to process each file individually. I'm showing Python, but Java is similar:
Was looking for similar solution myself. Following Kenn's recommendations and few other references such as XMLSource.java, created the following custom source which seems to be working fine.
I am not a developer so if anyone has suggestions on how to improve it, please feel free to contribute.