How to find out bad data causing this insert to fa

2019-08-23 01:00发布

问题:

I have a database (Postgres 9.3.5) of 80 millions records, the insert query below fails with:

ERROR:  invalid input syntax for integer: ""

INSERT INTO DISCOGS.TRACK_DURATION
     SELECT
        track_id,
        duration,
        hours_as_seconds + minutes_as_seconds + seconds as total_seconds
    FROM (
            select
            track_id,
            duration,
            CASE
                WHEN duration like '%:%:%' THEN (split_part(duration, ':', 1))::bigint * 60 * 60
                ELSE 0
            END  as hours_as_seconds,
            CASE
                WHEN duration like '%:%:%' THEN (split_part(duration, ':', 2))::bigint * 60
                WHEN duration like '%:%'  THEN  (split_part(duration, ':', 1))::bigint * 60
                ELSE 0
            END as minutes_as_seconds,
            CASE
                WHEN duration like '%:%:%' THEN (split_part(duration, ':', 3))::bigint
                WHEN duration like '%:%'   THEN (split_part(duration, ':', 2))::bigint
                ELSE 0
            END as seconds
            from discogs.track t1
            where release_id < 10000000
            and t1.duration!='' and t1.duration is not null
            and t1.position!=''
    ) as s1

I can use the where release_id to limit the number of records checked and with lower values its fine, so its bad data , but with so many records how do I find the problem data. Note I'm already filtering out values where duration is empty string and I also found a few records with bad data (such as %%%%) that I have changed but it is still failing.

回答1:

I would search for malformed durations using a regular expression, as in:

create table duration (
  d varchar(20)
);

insert into duration (d) values ('12:34:56');
insert into duration (d) values ('34:56');
insert into duration (d) values ('15::'); -- bad one
insert into duration (d) values (':34:56'); -- bad one
insert into duration (d) values (':34:'); -- bad one
insert into duration (d) values ('12:34:'); -- bad one
insert into duration (d) values ('34:'); -- bad one
insert into duration (d) values (':56'); -- bad one

select *
  from duration 
  where d not similar to '([0-9]+:)?[0-9]+:[0-9]+'

Result:

d                     
------
15::                  
:34:56                
:34:                  
12:34:                
34:                   
:56 

In your case the query should look like:

select track_id, duration 
  from discogs.track
  where duration not similar to '([0-9]+:)?[0-9]+:[0-9]+';