I have a task to complete where I have to import around 970,000 users from someone else's database into ours. In the source DB, there is a phone number field that contains multiple phone numbers concatenated into one hideous string.
Here are some data examples:
|Home: 555-555-5555 Office: (555)-555-5555 Work: 5555555555|
|Home: Office: 555\555-5555 Work: 555-555-5555|
|Office: 555-555-5555 Home: (555)555-5555 some Comment here|
The problems that I'm running into is
- The order of the numbers is inconsistent
- There are some free text comments strewn about
- Some of the phone numbers have different formatting.
I would really prefer to do this through SQL if possible and am pretty stumped on doing this in an efficient manner with minimal Manual adjustment.
In my DB, we have separate columns for each phone number type, so I basically need to split those strings into their appropriate columns.
Please tell me if I left anything out.
In this code, you will need 3 extra columns to store the new phone numbers
This is the logic in the code
splitting the phone numbers
cutting off text before first 3 last 3 numerics in the split result
deleting the alien characters used in phone number(only those used in sample)
inserting replacement separators '-' at position 7 and 4
grouping up data
updating table
Sample data:
DECLARE @t table
(phone varchar(500), home varchar(50), work varchar(50), office varchar(50))
INSERT @t(phone) values
('Home: 555-555-5551 Office: (555)-555-5555 Work: 5555555552|'),
('|Home: Office: 555\555-5555 Work: 555-555-5555|'),
('|Office: 555-555-5555 Home: (555)555-5555 some Comment here|')
Update:
;WITH CTE as
(
SELECT
nid,work, home, office,
t.c.value('.', 'VARCHAR(2000)') phone
FROM (
SELECT
row_number() over(order by (select 1)) nid, work, home,office,
x = CAST('<t>' +
REPLACE(REPLACE(REPLACE(phone, 'Work', '</t><t>work')
,'Office', '</t><t>Office'), 'Home', '</t><t>Home')
+ '</t>' AS XML)
FROM @t -- replace @t with your table
) a
CROSS APPLY x.nodes('/t') t(c)
WHERE t.c.value('.', 'VARCHAR(2000)') like '%[0-9][0-9][0-9]%'
), CTE2 as
(
SELECT
work,max(case when phone like '%work%' then z end) over(partition by nid)nwork,
home,max(case when phone like '%home%' then z end) over(partition by nid)nhome,
office,max(case when phone like '%office%' then z end) over(partition by nid)noffice
FROM cte t
CROSS APPLY(SELECT REVERSE(SUBSTRING(phone,PATINDEX('%[0-9][0-9][0-9]%', phone), 20))x)y
CROSS APPLY(SELECT STUFF(STUFF(REPLACE(REPLACE(REPLACE(REVERSE(
SUBSTRING(x, PATINDEX('%[0-9][0-9][0-9]%', x), 20)), ')', ''), '\', ''),
'-', ''),7,0, '-'),4,0,'-')z)v )
UPDATE CTE2
SET work = nwork, home = nhome, office = noffice
SELECT home,work,office FROM @t
Result:
home work office
555-555-5551 555-555-5552 555-555-5555
NULL 555-555-5555 555-555-5555
555-555-5555 NULL 555-555-5555
Using T-SQL for this task would not be the best choice. The closest solution would be to create a CLR assembly which would utilise the RegEx functionality available in .NET.
However, you may also look into Data Quality Services. It is a SQL Server component which was created for this particular type of task - manual entry cleansing, unification, de-duplication, etc. It requires BI or Enterprise edtion of SQL Server, however.