Skipping header rows - is it possible with Cloud D

2019-02-22 04:04发布

问题:

I've created a Pipeline, which reads from a file in GCS, transforms it, and finally writes to a BQ table. The file contains a header row (fields).

Is there any way to programatically set the "number of header rows to skip" like you can do in BQ when loading in?

回答1:

This is not currently possible. It sounds like there are two potential requests here:

  • Specifying presence and skip behavior for header lines for a BigQuery import.
  • Specifying that a GCS text source should skip a header line.

Future work on this is tracked in https://issues.apache.org/jira/browse/BEAM-123.

Also, in the meantime, you could add a simple filter to your ParDo code to skip headers. Something like this:

PCollection<X> rows = ...;
PCollection<X> nonHeaders =
   rows.apply(Filter.by(new MatchIfNonHeader()));