Redshift COPY using JSONPath for missing array/fie

2019-07-20 17:34发布

问题:

I am using the COPY command to load the JSON dataset from S3 to Redshift table. The data is getting loaded partially but it ignores records which has missing data(key-value/array) i.e. from the below example only the first record will get loaded.

Query:

COPY address from 's3://mybucket/address.json'
credentials 'aws_access_key_id=XXXXXXX;aws_secret_access_key=XXXXXXX' maxerror as 250
json 's3:/mybucket/address_jsonpath.json';

My question is how can I load all the records from address.json even when some records will have missing key/data, similar to the below sample data set.

Sample of JSON

{
  "name": "Sam P",
  "addresses": [
    {
      "zip": "12345",
      "city": "Silver Spring",
      "street_address": "2960 Silver Ave",
      "state": "MD"
    },
    {
      "zip": "99999",
      "city": "Curry",
      "street_address": "2960 Silver Ave",
      "state": "PA"
    }
  ]
}
{
  "name": "Sam Q",
  "addresses": [ ]
}
{
  "name": "Sam R"
}

Is there an alternative to FILLRECORD for JSON dataset?

I am looking for an implementation or a workaround which can load all the above 3 records in the Redshift table.

回答1:

There is no FILLRECORD equivalent for COPY from JSON. It is explicitly not supported in the documentation.

But you have a more fundamental issue - the first record contains an array of multiple addresses. Redshift's COPY from JSON does not allow you to create multiple rows from nested arrays.

The simplest way to resolve this is to define the files to be loaded as an external table and use our nested data syntax to expand the embedded array into full rows. Then use an INSERT INTO to load the data to a final table.

DROP TABLE IF EXISTS spectrum.partial_json;
CREATE EXTERNAL TABLE spectrum.partial_json (
  name       VARCHAR(100),
  addresses  ARRAY<STRUCT<zip:INTEGER
                         ,city:VARCHAR(100)
                         ,street_address:VARCHAR(255)
                         ,state:VARCHAR(2)>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-test-files/partial_json/'
;

INSERT INTO final_table 
SELECT ext.name
     , address.zip
     , address.city
     , address.street_address
     , address.state
FROM spectrum.partial_json ext
LEFT JOIN ext.addresses address ON true
;
--  name  |  zip  |     city      | street_address  | state
-- -------+-------+---------------+-----------------+-------
--  Sam P | 12345 | Silver Spring | 2960 Silver Ave | MD
--  Sam P | 99999 | Curry         | 2960 Silver Ave | PA
--  Sam Q |       |               |                 |
--  Sam R |       |               |                 |

NB: I tweaked your example JSON a little to make this simpler. For instance you had un-keyed objects as the values for name that I made into plain string values.