My filename contains information that I need in my pipeline, for example the identifier for my data points is part of the filename and not a field in the data. e.g Every wind turbine generates a file turbine-loc-001-007.csv. e.g And I need the loc data within the pipeline.
相关问题
- Why do Dataflow steps not start?
- Apache beam DataFlow runner throwing setup error
- Apply Side input to BigQueryIO.read operation in A
- Reading BigQuery federated table as source in Data
- CloudDataflow can not use “google.cloud.datastore”
相关文章
- Apply TensorFlow Transform to transform/scale feat
- Kafka to Google Cloud Platform Dataflow ingestion
- How to run dynamic second query in google cloud da
- How do I use MapElements and KV in together in Apa
- Beam/Google Cloud Dataflow ReadFromPubsub Missing
- Cloud Dataflow failure recovery
- Difference between beam.ParDo and beam.Map in the
- KafkaIO checkpoint - how to commit offsets to Kafk
Java (sdk 2.9.0):
Beams TextIO readers do not give access to the filename itself, for these use cases we need to make use of FileIO to match the files and gain access to the information stored in the file name. Unlike TextIO, the reading of the file needs to be taken care of by the user in transforms downstream of the FileIO read. The results of a FileIO read is a PCollection the ReadableFile class contains the file name as metadata which can be used along with the contents of the file.
FileIO does have a convenience method readFullyAsUTF8String() which will read the entire file into a String object, this will read the whole file into memory first. If memory is a concern you can work directly with the file with utility classes like FileSystems.
From: Document Link
Python (sdk 2.9.0):
For 2.9.0 for python you will need to collect the list of URI from outside of the Dataflow pipeline and feed it in as a parameter to the pipeline. For example making use of FileSystems to read in the list of files via a Glob pattern and then passing that to a PCollection for processing.
Once fileio see PR https://github.com/apache/beam/pull/7791/ is available, the following code would also be an option for python.