Copy a datetime with the format rfc822 into redshi

2020-07-20 04:05发布

I have the following redshift table:

DROP TABLE IF EXISTS "logs";
CREATE TABLE "logs" (
  "source" varchar(255) DEFAULT NULL,
  "method" varchar(255) DEFAULT NULL,
  "path" varchar(1023) DEFAULT NULL,
  "format" varchar(255) DEFAULT NULL,
  "controller" varchar(255) DEFAULT NULL,
  "action" varchar(255) DEFAULT NULL,
  "status" integer DEFAULT NULL,
  "duration" float DEFAULT NULL,
  "view" float DEFAULT NULL,
  "db" float DEFAULT NULL,
  "ip" varchar(255)DEFAULT NULL,
  "route" varchar(255) DEFAULT NULL,
  "request_id" varchar(255) DEFAULT NULL,
  "user" INTEGER DEFAULT  NULL,
  "school" varchar(255) DEFAULT NULL,
  "timestamp" datetime DEFAULT NULL
);

So far so good.

The only problem is that the datetime in my source file on s3 is the following: "2015-01-13T11:13:08.869941+00:00". This looks like rfc822 (or rfc3339 or rfc2822).

A few timeformats are supported by the COPY command (see doc: http://docs.aws.amazon.com/redshift/latest/dg/r_DATEFORMAT_and_TIMEFORMAT_strings.html). But not my rfc822 format.

I've tried the following:

TRUNCATE logs;
COPY "logs" FROM 's3://path/to/logstash_logfile.gz'
CREDENTIALS 'aws_access_key_id=THE_KEY;aws_secret_access_key=THE_SECRET'
TIMEFORMAT AS 'MM-DD-YYYYTHH:MI:SS'
JSON 's3://path/to/jsonpath.json' GZIP;

But I'm getting:

SELECT * FROM stl_load_errors;

Invalid timestamp format or value [MM-DD-YYYYTHH:MI:SS]

2条回答
家丑人穷心不美
2楼-- · 2020-07-20 04:34

Use TIMEFORMAT 'auto' instead.

It's able to import

2015-01-13T11:13:08.869941+00:00

as

2015-01-13 11:13:08.869941.

I assume this method just discards the timezone information, but at least you can get the data in this way.

If you have various timezones in the data, maybe you need to do some preprocessing to convert everything into UTC, for example.

Unfortunately I think the COPY with a provided time format is rather strict and doesn't support timezone parts.

查看更多
冷血范
3楼-- · 2020-07-20 04:39

We had exactly the same issue and found a workaround:

CREATE TABLE final_table ("ts_as_timestamptz" TIMESTAMPTZ);
CREATE TEMP TABLE helper_table ("ts_as_varchar" VARCHAR(64));

COPY "helper_table" FROM 's3://path/to/file.csv.gz'
CREDENTIALS 'aws_access_key_id=THE_KEY;aws_secret_access_key=THE_SECRET'
CSV
GZIP;

INSERT INTO final_table (ts_as_timestamptz)
SELECT ts_as_varchar::TIMESTAMPTZ FROM helper_table;

Or, alternatively:

CREATE TABLE final_table ("ts_as_timestamp" TIMESTAMP);
CREATE TEMP TABLE helper_table ("ts_as_varchar" VARCHAR(64));

COPY "helper_table" FROM 's3://path/to/file.csv.gz'
CREDENTIALS 'aws_access_key_id=THE_KEY;aws_secret_access_key=THE_SECRET'
CSV
GZIP;

INSERT INTO final_table (ts_as_timestamp)
SELECT ts_as_varchar::TIMESTAMPTZ FROM helper_table;

You can quickly test with this:

DROP TABLE IF EXISTS helper_table;
CREATE TEMP TABLE helper_table ("ts_as_varchar" VARCHAR(64));
INSERT INTO helper_table (ts_as_varchar) VALUES 
    ('2015-01-13T11:13:08.869941+00:00'),
    ('2015-01-13T12:13:08.869941+01:00'),
    ('2015-01-13T13:13:08.869+02:00'), 
    ('2015-01-13T14:13:08+03:00'),
    ('2015-01-13T11:13:08'),
    ('2015-01-13 11:13:08.869941+00:00'),
    ('2015-01-13 12:13:08.869941+01:00'),
    ('2015-01-13 13:13:08.869+02:00'), 
    ('2015-01-13 14:13:08+03:00'),
    ('2015-01-13 11:13:08')
;

DROP TABLE IF EXISTS final_table;
CREATE TEMP TABLE final_table (
    "ts_as_varchar" VARCHAR(64),
    "ts_as_timestamptz" TIMESTAMPTZ,
    "ts_as_timestamp" TIMESTAMP
    );
INSERT INTO final_table (ts_as_varchar, ts_as_timestamptz, ts_as_timestamp)
SELECT ts_as_varchar, ts_as_varchar::TIMESTAMPTZ, ts_as_varchar::TIMESTAMPTZ
FROM helper_table;

-- The following depends on the time zone of your SQL client, so the results may vary. It is also vulnerable to the SQL client removing the sub-second parts.
-- SELECT * FROM final_table;
-- The following may (?) work better even if your SQL client is not in UTC
SELECT ts_as_varchar, ts_as_timestamptz::VARCHAR, ts_as_timestamp::VARCHAR
FROM final_table;

Which gives these results:

ts_as_varchar                       ts_as_timestamptz                ts_as_timestamp
2015-01-13T11:13:08                 2015-01-13 11:13:08+00           2015-01-13 11:13:08
2015-01-13T11:13:08.869941+00:00    2015-01-13 11:13:08.869941+00    2015-01-13 11:13:08.869941
2015-01-13T12:13:08.869941+01:00    2015-01-13 11:13:08.869941+00    2015-01-13 11:13:08.869941
2015-01-13T13:13:08.869+02:00       2015-01-13 11:13:08.869+00       2015-01-13 11:13:08.869
2015-01-13T14:13:08+03:00           2015-01-13 11:13:08+00           2015-01-13 11:13:08
2015-01-13 11:13:08                 2015-01-13 11:13:08+00           2015-01-13 11:13:08
2015-01-13 11:13:08.869941+00:00    2015-01-13 11:13:08.869941+00    2015-01-13 11:13:08.869941
2015-01-13 12:13:08.869941+01:00    2015-01-13 11:13:08.869941+00    2015-01-13 11:13:08.869941
2015-01-13 13:13:08.869+02:00       2015-01-13 11:13:08.869+00       2015-01-13 11:13:08.869
2015-01-13 14:13:08+03:00           2015-01-13 11:13:08+00           2015-01-13 11:13:08

Tested using Redshift 1.0.2610 Note that your SQL client or driver may do some timezone conversions that may be misleading, so the best is to test this using UTC as the time zone of your computer / driver / SQL client. Also, some SQL clients remove the sub-second part of the timestamps.

查看更多
登录 后发表回答