Importing a CSV file (with empty strings and d

I have a CSV file that I'm trying to import to Amazon DynamoDB. So I upload it to S3, set up a EMR cluster, and create an external table like this:

hive> CREATE EXTERNAL TABLE s3_table_myitems (colA BIGINT, colB STRING, colC STRING, colD DOUBLE, colE DOUBLE, colF STRING, colG STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES ('serialization.null.format'='""')
    STORED AS TEXTFILE
    LOCATION 's3://bucketname/dirname/'
    TBLPROPERTIES ('skip.header.line.count'='1');

Any of the columns in the CSV may be empty, but DynamoDB can't deal with empty strings ("com.amazonaws.AmazonServiceException: One or more parameter values were invalid: An AttributeValue may not contain an empty string").

This is what Amazon says:

We will consider this optional "ignore empty string" behavior in a future release. … As a workaround, you could … transform empty attribute values into NULLs. For example, you can … use a more complex SELECT expression to turn empty strings into something else, including setting them to NULL.

So this is what I came up with, but it looks ugly:

hive> INSERT INTO TABLE ddb_tbl_ingredients
    SELECT
    regexp_replace(colA, '^$', 'NULL'),
    regexp_replace(colB, '^$', 'NULL'),
    regexp_replace(colC, '^$', 'NULL'),
    regexp_replace(colD, '^$', 'NULL'),
    regexp_replace(colE, '^$', 'NULL'),
    regexp_replace(colF, '^$', 'NULL'),
    regexp_replace(colG, '^$', 'NULL')
    FROM s3_table_ingredients;

Is there a better solution to the overall problem (short of pre-processing the CSV), or at least a better SELECT syntax?

Edit: I ended up having to deal with duplicates as well ("com.amazonaws.AmazonServiceException: Provided list of item keys contains duplicates").

For posterity, here's my complete flow. I'd love to hear of a better way of doing this, both for aesthetics and for performance. The task is seemingly simple ("importing a CSV file into DynamoDB") but coming up with this has taken hours so far :P

# source
hive> CREATE EXTERNAL TABLE s3_table_myitems (colA STRING, colB STRING, colC DOUBLE, colD DOUBLE, colE STRING, colF STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES ('serialization.null.format'='""')
    STORED AS TEXTFILE
    LOCATION 's3://bucketname/dirname/'
    TBLPROPERTIES ('skip.header.line.count'='1');

# destination
hive> CREATE EXTERNAL TABLE ddb_tbl_myitems (colA STRING, colB STRING, colC DOUBLE, colD DOUBLE, colE STRING, colF STRING)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
    TBLPROPERTIES ("dynamodb.table.name" = "myitems",
        "dynamodb.column.mapping" = "colA:colA,colB:colB,colC:colC,colD:colD,colE:colE,colF:colF");

# remove dupes - http://stackoverflow.com/a/34165762/594211
hive> CREATE TABLE tbl_myitems_deduped AS
    SELECT colA, min(colB) AS colB, min(colC) AS colC, min(colD) AS colD, min(colE) AS colE, min(colF) AS colF
    FROM (SELECT colA, colB, colC, colD, unit, colF, rank() OVER
        (PARTITION BY colA ORDER BY colB, colC, colD, colE, colF)
        AS col_rank FROM s3_table_myitems) t
    WHERE t.col_rank = 1
    GROUP BY colA;

# replace empty strings with placeholder 'NULL'
hive> CREATE TABLE tbl_myitems_noempty AS
    SELECT colA,
    regexp_replace(colB, '^$', 'NULL') AS colB,
    regexp_replace(colC, '^$', 'NULL') AS colC,
    regexp_replace(colD, '^$', 'NULL') AS colD,
    regexp_replace(colE, '^$', 'NULL') AS colE,
    regexp_replace(colF, '^$', 'NULL') AS colF
    FROM tbl_myitems_deduped
    WHERE LENGTH(colA) > 0;

# ...other preprocessing here...

# insert to DB
hive> INSERT INTO TABLE ddb_tbl_myitems
    SELECT * FROM tbl_myitems_noempty;

Note: colA is the partition key.

标签： csv hadoop amazon-web-services hive amazon-dynamodb

1条回答

Importing a CSV file (with empty strings and d

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间