Is there a way to read an Excel file using Dataflo

2019-07-22 20:33发布

问题:

Is there a way to read an Excel file stored in a GCS bucket using Dataflow?

And I would also like to know if we can access the metadata of an object in GCS using Dataflow. If yes then how?

回答1:

CSV files are often used to read files from excel. These files can be split and read line by line so they are ideal for dataflow. You can use TextIO.Read to pull in each line of the file, then parse them as CSV lines.

If you want to use a different binary excel format, then I believe that you would need to read in the entire file and use a library to parse it. I recommend using CSV files if you can.

As for reading the GCS metadata. I don't think that you can do this with TextIO, but you could call the GCS API directly to access the metadata. If you only do this for a few files at the start of your program then it will work and not be too expensive. If you need to read many files like this, you'll be adding an extra RPC for each file.

Be careful to not read the same file multiple times, I suggest reading each file's metadata once once and then writing the metadata out to a side input. Then in one of your ParDo's you can access the side input for each file.

Useful links: ETL & Parsing CSV files in Cloud Dataflow

https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/TextIO.Read

https://cloud.google.com/dataflow/model/par-do#side-inputs



回答2:

You should be able to read the metadata of a GCS file by using the GCS API. However you would need the filenames. You can do this by doing a ParDo or other transform over a list of PCollection<string> which holds the filenames.

We don't have any default readers for excel files. You can parse from a CSV file by using a text input:(ETL & Parsing CSV files in Cloud Dataflow)

I'm not very knowledgeable on excel, and how the file format is stored. If you want to process one file at a time, you can use a PCollection<string> of files. And then use some library to parse the excel file at a time.

If an excel file can be split into easily-parallelizable parts, I'd suggest you take a look at this doc (https://beam.apache.org/documentation/io/authoring-overview/). (If you are still using Dataflow SDK, it should be similar.) It may be worth splitting into smaller chunks before reading to get more parallelization out of your pipeline. In this case you could use IOChannelFactory to read from the file.



回答3:

private static final int BUFFER_SIZE = 64 * 1024;
  private static void printBlob(com.google.cloud.storage.Storage storage, String bucketName, String blobPath) throws IOException, InvalidFormatException {
        try (ReadChannel reader = ((com.google.cloud.storage.Storage) storage).reader(bucketName, blobPath)) {
            InputStream inputStream = Channels.newInputStream(reader);
            Workbook wb = WorkbookFactory.create(inputStream);
            StringBuffer data = new StringBuffer();
            for(int i=0;i<wb.getNumberOfSheets();i++) {
            String fName = wb.getSheetAt(i).getSheetName();
            File outputFile = new File("D:\\excel\\"+fName+".csv");
            FileOutputStream fos = new FileOutputStream(outputFile);
            XSSFSheet sheet = (XSSFSheet) wb.getSheetAt(i);
            Iterator<Row> rowIterator = sheet.iterator();
            data.delete(0, data.length());
            while (rowIterator.hasNext())
            {
                // Get Each Row
                Row row = rowIterator.next();
                data.append('\n'); 
                // Iterating through Each column of Each Row
                Iterator<Cell> cellIterator = row.cellIterator();

                while (cellIterator.hasNext())
                {
                    Cell cell = cellIterator.next();

                    // Checking the cell format
                    switch (cell.getCellType())
                    {
                    case Cell.CELL_TYPE_NUMERIC:
                        data.append(cell.getNumericCellValue() + ",");
                        break;
                    case Cell.CELL_TYPE_STRING:
                         data.append(cell.getStringCellValue() + ",");
                        break;
                    case Cell.CELL_TYPE_BOOLEAN:
                        data.append(cell.getBooleanCellValue() + ",");
                        break;
                    case Cell.CELL_TYPE_BLANK:
                        data.append("" + ",");
                        break;
                    default:
                        data.append(cell + ",");
                    }
                }

            }
            fos.write(data.toString().getBytes());
            }    

        }
      }