Hex string to integer conversion in Amazon Redshif

2020-03-14 18:10发布

Amazon Redshift is based on ParAccel which is based on Postgres. From my research it seems that the preferred way to perform hexadecimal string to integer conversion in Postgres is via a bit field, as outlined in this answer.

In the case of bigint, this would be:

select ('x'||lpad('123456789abcdef',16,'0'))::bit(64)::bigint

Unfortunately, this fails on Redshift with:

ERROR: cannot cast type text to bit [SQL State=42846] 

What other ways are there to perform this conversion in Postgres 8.1ish (that's close to the Redshift level of compatibility)? UDFs are not supported in Redshift and neither are array, regex functions or set generating functions...

3条回答
劫难
2楼-- · 2020-03-14 18:22

A likely explanation is that the cast from text to bit(n) relies on undocumented behavior, I repeat the quote from Tom Lane:

This is relying on some undocumented behavior of the bit-type input converter, but I see no reason to expect that would break. A possibly bigger issue is that it requires PG >= 8.3 since there wasn't a text to bit cast before that.

And Amazon derivate is obviously not allowing this undocumented feature. Not surprising, since it is based off of Postgres 8.1 where there was no cast at all.

Previously quoted in this closely related answer:
Convert hex in text representation to decimal number

查看更多
够拽才男人
3楼-- · 2020-03-14 18:27

It looks like they added a function for this at some point: STRTOL

Syntax

STRTOL(num_string, base)

Return type

BIGINT. If num_string is null, returns NULL.

For example

SELECT strtol('deadbeef', 16);

Returns: 3735928559

查看更多
孤傲高冷的网名
4楼-- · 2020-03-14 18:36

Assuming that you want a simple digit-by-digit ordinal position conversion (i.e. you're not worried about two's compliment negatives, etc) I think this should work on an 8.1-equivalent DB:

CREATE OR REPLACE FUNCTION hex2dec(text) RETURNS bigint AS $$
SELECT sum(CASE WHEN v >= ascii('a') THEN v - ascii('a') + 10 ELSE v - ascii('0') END * 16^ordpos)::bigint
FROM (
    SELECT n-1, ascii(substring(reverse($1), n, 1))
    FROM generate_series(1, length($1)) n
) AS x(ordpos, v);
$$ LANGUAGE sql IMMUTABLE;

The function form is optional, it just makes it easier to avoid repeating the argument a bunch of times. It should get inlined anyway. Efficiency will probably be awful, but most of the tools available to do this smarter don't seem to be available on versions that old, and this at least works:

regress=> CREATE TABLE t AS VALUES ('c13b'), ('a'), ('f');
regress=> SELECT hex2dec(column1) FROM t;
 hex2dec 
---------
   49467
      10
      15
(3 rows)

If you can use regexp_split_to_array and generate_subscripts it might be faster. Or slower. I haven't tried. Another possible trick is to use a digit mapping array instead of the CASE, like:

'[48:102]={0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,11,12,13,14,15}'::integer[]

which you can use with:

CREATE OR REPLACE FUNCTION hex2dec(text) RETURNS bigint AS $$
SELECT sum(
  ('[48:102]={0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,11,12,13,14,15}'::integer[])[ v ]
  * 16^ordpos
)::bigint
FROM (
    SELECT n-1, ascii(substring(reverse($1), n, 1))
    FROM generate_series(1, length($1)) n
) AS x(ordpos, v);
$$ LANGUAGE sql IMMUTABLE;

Personally, I'd do it client-side instead, rather than wrangling the limited capabilities of an old PostgreSQL fork, especially one you can't load your own sensible user-defined C functions on, or use PL/Perl, etc.


In real PostgreSQL I'd just use this:

hex2dec.c:

#include "postgres.h"
#include "fmgr.h"
#include "utils/builtins.h"
#include "errno.h"
#include "limits.h"
#include <stdlib.h>

PG_MODULE_MAGIC;

Datum from_hex(PG_FUNCTION_ARGS);

PG_FUNCTION_INFO_V1(hex2dec);

Datum
hex2dec(PG_FUNCTION_ARGS)
{
    char *endpos;
    const char *hexstr = text_to_cstring(PG_GETARG_TEXT_PP(0));
    long decval = strtol(hexstr, &endpos, 16);
    if (endpos[0] != '\0')
    {
        ereport(ERROR, (ERRCODE_INVALID_PARAMETER_VALUE, errmsg("Could not decode input string %s as hex", hexstr)));
    }
    if (decval == LONG_MAX && errno == ERANGE)
    {
        ereport(ERROR, (ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE, errmsg("Input hex string %s overflows int64", hexstr)));
    }
    PG_RETURN_INT64(decval);
}

Makefile:

MODULES = hex2dec
DATA = hex2dec--1.0.sql
EXTENSION = hex2dec

PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

hex2dec.control:

comment = 'Utility function to convert hex strings to decimal'
default_version = '1.0'
module_pathname = '$libdir/hex2dec'
relocatable = true

hex2dec--1.0.sql:

CREATE OR REPLACE FUNCTION hex2dec(hexstr text) RETURNS bigint
        AS 'hex2dec','hex2dec'
        LANGUAGE c IMMUTABLE STRICT;

COMMENT ON FUNCTION hex2dec(hexstr text)
IS 'Decode the hex string passed, which may optionally have a leading 0x, as a bigint. Does not attempt to consider negative hex values.';

Usage:

CREATE EXTENSION hex2dec;

postgres=# SELECT hex2dec('7fffffffffffffff');
       hex2dec       
---------------------
 9223372036854775807
(1 row)

postgres=# SELECT hex2dec('deadbeef');
  hex2dec   
------------
 3735928559
(1 row)

postgres=# SELECT hex2dec('12345');
 hex2dec 
---------
   74565
(1 row)

postgres=# select hex2dec(to_hex(-1));
  hex2dec   
------------
 4294967295
(1 row)

postgres=# SELECT hex2dec('8fffffffffffffff');
ERROR:  Input hex string 8fffffffffffffff overflows int64

postgres=# SELECT hex2dec('0x7abcz123');
ERROR:  Could not decode input string 0x7abcz123 as hex

The performance difference is ... noteworthy. Given sample data:

CREATE TABLE randhex AS 
SELECT '0x'||to_hex( abs(random() * (10^((random()-.5)*10)) * 10000000)::bigint) AS h
FROM generate_series(1,1000000);

conversion from hex to decimal takes about 1.3 from a warm cache using the C extension, which isn't great for a million rows. Reading them without any transformation takes 0.95s. It took 36 seconds for the SQL based hex2dec approach to process the same rows. Frankly I'm really impressed that the SQL approach was as fast as that, and surprised the C ext was that slow.

查看更多
登录 后发表回答