How to batch load custom Avro data generated from

2019-07-22 01:29发布

问题:

The Cloud Spanner docs say that Spanner can export/import Avro format. Can this path also be used for batch ingestion of Avro data generated from another source? The docs seem to suggest it can only import Avro data that was also generated by Spanner.

I ran a quick export job and took a look at the generated files. The manifest and schema look pretty straight forward. I figured I would post here in case this rabbit hole is deep.

manifest file

'

{
  "files": [{
    "name": "people.avro-00000-of-00001",
    "md5": "HsMZeZFnKd06MVkmiG42Ag=="
  }]
}

schema file

{
  "tables": [{
    "name": "people",
    "manifestFile": "people-manifest.json"
  }]
}

data file

    {"type":"record",
    "name":"people",
    "namespace":
    "spannerexport","
    fields":[
{"name":"fullName",
"type":["null","string"],
"sqlType":"STRING(MAX)"},{"name":"memberId",
"type":"long",
"sqlType":"INT64"}
],
    "googleStorage":"CloudSpanner",
    "spannerPrimaryKey":"`memberId` ASC",
    "spannerParent":"",
    "spannerPrimaryKey_0":"`memberId` ASC",
    "googleFormatVersion":"1.0.0"}    

回答1:

In response to your question, yes! There are two ways to do ingestion of Avro data into Cloud Spanner.

Method 1

If you place Avro files in a Google Cloud Storage bucket arranged as a Cloud Spanner export operation would arrange them and you generate a manifest formatted as Cloud Spanner expects, then using the import functionality in the web interface for Cloud Spanner will work. Obviously, there may be a lot of tedious formatting work here which is why the official documentation states that this "import process supports only Avro files exported from Cloud Spanner".

Method 2

Instead of executing the import/export job using the Cloud Spanner web console and relying on the Avro manifest and data files to be perfectly formatted, slightly modify the code in either of two public code repositories on GitHub under the Google Cloud Platform user that provide import/export (or backup/restore or export/ingest) functionality for moving data from Avro format into Google Cloud Spanner: (1) Dataflow Templates, especially this file (2) Pontem, especially this file.

Both of these have Dataflow jobs written that allow you to move data into and out of Cloud Spanner using the Avro format. Each has a specific means of parsing an Avro schema for input (i.e., moving data from Avro into Cloud Spanner). Since your use-case is input (i.e., ingesting data into Cloud Spanner that is Avro-formatted), you need to modify the Avro parsing code to fit your specific schema and then execute the Cloud Dataflow job from the commandline locally on your machine (the job is then uploaded to Google Cloud Platform).

If you are not familiar with Cloud Dataflow, it is a tool for defining and running jobs with large data sets.



回答2:

As the documentation specifically states that importing only supports Avro files initially exported from Spanner 1, I've raised a feature request for this which you can track here

1 https://cloud.google.com/spanner/docs/import