MS SQL — How to Extract Phone Number From Long Inc

2019-07-26 08:43发布

问题:

I have a task to complete where I have to import around 970,000 users from someone else's database into ours. In the source DB, there is a phone number field that contains multiple phone numbers concatenated into one hideous string.

Here are some data examples:

|Home: 555-555-5555 Office: (555)-555-5555 Work: 5555555555|
|Home: Office: 555\555-5555 Work: 555-555-5555|
|Office: 555-555-5555 Home: (555)555-5555 some Comment here|

The problems that I'm running into is

  1. The order of the numbers is inconsistent
  2. There are some free text comments strewn about
  3. Some of the phone numbers have different formatting.

I would really prefer to do this through SQL if possible and am pretty stumped on doing this in an efficient manner with minimal Manual adjustment.

In my DB, we have separate columns for each phone number type, so I basically need to split those strings into their appropriate columns.

Please tell me if I left anything out.

回答1:

In this code, you will need 3 extra columns to store the new phone numbers

This is the logic in the code

  1. splitting the phone numbers

  2. cutting off text before first 3 last 3 numerics in the split result

  3. deleting the alien characters used in phone number(only those used in sample)

  4. inserting replacement separators '-' at position 7 and 4

  5. grouping up data

  6. updating table

Sample data:

DECLARE @t table
  (phone varchar(500), home varchar(50), work varchar(50), office varchar(50))
INSERT @t(phone) values
('Home: 555-555-5551 Office: (555)-555-5555 Work: 5555555552|'),
('|Home: Office: 555\555-5555 Work: 555-555-5555|'),
('|Office: 555-555-5555 Home: (555)555-5555 some Comment here|')

Update:

;WITH CTE as
(
     SELECT
       nid,work, home, office, 
       t.c.value('.', 'VARCHAR(2000)') phone
     FROM (
         SELECT
           row_number() over(order by (select 1)) nid, work, home,office, 
           x = CAST('<t>' + 
               REPLACE(REPLACE(REPLACE(phone, 'Work', '</t><t>work')
               ,'Office', '</t><t>Office'), 'Home', '</t><t>Home')
                + '</t>' AS XML)
        FROM @t -- replace @t with your table
     ) a
     CROSS APPLY x.nodes('/t') t(c)
     WHERE t.c.value('.', 'VARCHAR(2000)') like '%[0-9][0-9][0-9]%'
), CTE2 as
(
SELECT 
work,max(case when phone like '%work%' then z end) over(partition by nid)nwork,
home,max(case when phone like '%home%' then z end) over(partition by nid)nhome,
office,max(case when phone like '%office%' then z end) over(partition by nid)noffice
FROM cte t
CROSS APPLY(SELECT REVERSE(SUBSTRING(phone,PATINDEX('%[0-9][0-9][0-9]%', phone), 20))x)y
CROSS APPLY(SELECT STUFF(STUFF(REPLACE(REPLACE(REPLACE(REVERSE(
  SUBSTRING(x, PATINDEX('%[0-9][0-9][0-9]%', x), 20)), ')', ''), '\', ''),
  '-', ''),7,0, '-'),4,0,'-')z)v    )
UPDATE CTE2
SET work = nwork, home = nhome, office = noffice

SELECT home,work,office FROM @t

Result:

home          work          office
555-555-5551  555-555-5552  555-555-5555
NULL          555-555-5555  555-555-5555
555-555-5555  NULL          555-555-5555


回答2:

Using T-SQL for this task would not be the best choice. The closest solution would be to create a CLR assembly which would utilise the RegEx functionality available in .NET.

However, you may also look into Data Quality Services. It is a SQL Server component which was created for this particular type of task - manual entry cleansing, unification, de-duplication, etc. It requires BI or Enterprise edtion of SQL Server, however.