Have two sets of data (two tables) for patient records, one 1999-2003, the other 2004-2009. Each has >100 columns; Table_A has ~8 unique columns, Table_B ~ 25 unique columns (Compared to each other). My goal is:
- A single table with all data from 1999-2009
- For rows in one table that are not in the other, simply have a value of NULL for that column. e.g. if Table A has Diagnostic_Category_12 but Table_B does not, the value will be the original value in Table A, but NULL in table B
I've seen a method for doing this manually:
Unioning Two Tables With Different Number Of Columns
However there are far too many columns in this data set to type each one in - I'd like to just auto-create columns and insert NULL values as needed.
I am using SQL Server 2008R2.
Work smarter, not harder.
I'd recommend that you build up some SQL by querying your schema... this way you don't miss anything by writing things by hand. You can generate the script like so (just replace @tableName1
and @tableName2
values with the appropriate table names):
declare
@tableName1 sysname = 'myfirsttablename'
,@tableName2 sysname = 'mysecondtablename'
,@select varchar(max) = 'select';
declare @columns table
(
Id int identity(1,1)
,ColumName nvarchar(128)
,ExistsInTable1 bit
,ExistsInTable2 bit
);
-- Get a column listing with flags for their existence in each table
insert @columns
select distinct
quotename(c.Column_Name)
,iif(c2.Table_Name is null, 0, 1)
,iif(c3.Table_Name is null, 0, 1)
from Information_Schema.Columns as c
left join Information_Schema.Columns as c2
on c2.Column_Name = c.Column_Name
and c2.Table_Name = @tableName1
left join Information_Schema.Columns as c3
on c3.Column_Name = c.Column_Name
and c3.Table_Name = @tableName2
where c.Table_Name in (@tableName1, @tableName2);
-- Build the select statement for the 1sttable (using null where the column is absent)
select
@select += char(10) + iif(c.Id = 1, ' ', ',')
+ iif(c.ExistsInTable1 = 1, c.ColumName, 'null') + ' as ' + c.ColumName
from @columns as c
order by c.Id;
set @select += '
from ' + quotename(@tableName1) + '
union all
select';
-- Build the select statement for the 2ndtable (using null where the column is absent)
select
@select += char(10) + iif(c.Id = 1, ' ', ',')
+ iif(c.ExistsInTable2 = 1, c.ColumName, 'null') + ' as ' + c.ColumName
from @columns as c
order by c.Id;
set @select += '
from ' + quotename(@tableName2);
-- Print or execute your sql.
print(@select); -- or exec(@select);
Once you've generated your SQL, I'd recommend that you:
- Verify your results and tweak your query as necessary.
- Place the finalized SQL in a stored procedure rather than generating it on the fly for each request.
Even if you think
there are far too many columns in this
data set to type each one in
it's the right thing to do. Any other solution will be basically a hack.
It's easy to do, and I do it often with wider tables (150ish fields).
In SSMS, right click the larger of the two tables, Script Table As
-> Select To
-> New Query Editor Window
. This will output to a new window a select script listing every field in that table, and every field will be on its own row so it's easy to manage.
This is really going to be about 5 minutes of work. Just do it right the first time.
The quick and dirty way would be to add NULL columns with the name of the other tables' unique columns to each table. E.g.:
ALTER TABLE TableA ADD tableBUniqueColumn1 INT SPARSE NULL, tableBUniqueColumn2 INT SPARSE NULL, ...
ALTER TABLE TableB ADD tableAUniqueColumn1 INT SPARSE NULL, tableAUniqueColumn2 INT SPARSE NULL, ...
Now the tables will have the same schema and you can easily perform unions on them.
This is a very hacky workaround. Tables that contain SPARSE NULL columns are usually a warning sign that you are not creating relations and are instead trying to put all data into one table. This is usually a mistake and will make it harder to maintain the data.
If you are trying to normalize your data, it is faster in the long run to create a new schema and populate it with your existing data, not hack your existing tables. It may seem like a lot of work to do that, but you will only have to do it once. If you decide to do hacky workarounds, your work will never end.
Wherever you want the NULL
values you still need to mention them in the query. It's probably going to be quite ugly...
Could you not use the year as a common column and then just LEFT OUTER JOIN
the other two tables to it? For example:
WITH Y AS (
SELECT 1999 YearId UNION SELECT 2000 UNION SELECT 2001 -- and so on...
)
SELECT Y.YearId, Table_A.*, Table_B.*
FROM Y LEFT OUTER JOIN
Table_A ON Y.YearId = Table_A.YearId LEFT OUTER JOIN
Table_B ON Y.YearId = Table_B.YearId
;
Over that 100 columns? Doesn't worth automation. I think manual approach is faster in this very case.
Anyway, there are many ways:
- Use information schema views or catalog views to access columns metadata to create a dynamic insert statement (union statement)
- Use SMO (server management objects) to write a program to merge tables into a third table
- Exporting both tables to Excel, merging them there together, then importing the result into a third table back.
And some other ways.